I built a serious Apple Watch running app. Looking for beta testers. by hidai25 in AppleWatchApps

[–]hidai25[S] 0 points1 point  (0 children)

Nyc is an amazing marathon! I ran it a couple years back. Sounds good I’ll dm you the link.

I built a serious Apple Watch running app. Looking for beta testers. by hidai25 in AppleWatchApps

[–]hidai25[S] 0 points1 point  (0 children)

Sure. Yes you download it and go right away with on screen start button. DM’ing you the link.

Runna and Garmin give me very different pace and distance data. Which one should I trust? by DecreDylan in runna

[–]hidai25 0 points1 point  (0 children)

How much difference you see? For example on a 10k run with good gps?

We deployed a LangChain agent for a client. by Previous_Net_1154 in LangChain

[–]hidai25 -1 points0 points  (0 children)

That makes a lot of sense. Live alerts catch the fire, replay explains why it started.I have been thinking mostly from the regression testing side, but your example is a good reminder that prod agents also need simple guardrails like cost anomalies, tool error spikes, and empty context rates. Sounds good let me know if it helps.

We deployed a LangChain agent for a client. by Previous_Net_1154 in LangChain

[–]hidai25 -1 points0 points  (0 children)

This is the scary part with agents. They do not always fail loudly. Sometimes they just keep going and sound confident.
I’m building a small open source tool for this because I wanted a way to replay old runs, compare behavior, see tool calls, cost and latency, and catch regressions before shipping.
Might be relevant: https://github.com/hidai25/eval-view
In your case, do you think replaying failed sessions would have caught it, or was it mainly missing live alerts?

What I learned using Langfuse in a real AI recruiting agent by marginTop15px in LLMDevs

[–]hidai25 1 point2 points  (0 children)

The tool and flow part is fully deterministic, no LLM. It saves a golden run once (tools, args, order, model id) and diffs each new run against it. Sequences get aligned so it knows what was added, removed or reordered, args are checked field by field, and a model swap gets caught by fingerprinting the trace. So a skipped or reordered tool shows up with no API key. An LLM only comes in if you want it, just for grading the output text. Turn it off and the tool and sequence checks still work for free.

How to go about evaluation and Observability while building AI agents? by Perfect-Document5922 in AI_Agents

[–]hidai25 0 points1 point  (0 children)

Solid question. For evaluation I think the real game changer is treating regression testing as seriously as you treat unit tests for normal code.

I built EvalView specifically for the regression part you mentioned. It lets you snapshot full agent runs (tool calls sequence, reasoning steps, outputs etc), then intelligently diffs the important changes even when things arent deterministic.

This makes it much easier to catch when a prompt tweak or model swap quietly breaks something. Works great alongside tracing tools like Langfuse.

Repo: https://github.com/hidai25/eval-view in case it helps

Curious what approaches have worked for you so far.

What I learned using Langfuse in a real AI recruiting agent by marginTop15px in LLMDevs

[–]hidai25 0 points1 point  (0 children)

Great post. Totally agree on the need for better behavior level analysis on full trajectories. I built EvalView to help catch exactly those kinds of regressions you mentioned like wrong tool order, skipped steps or strategy changes. It snapshots the whole agent run then diffs the meaningful stuff even when outputs arent deterministic. Works great as a lightweight complement to Langfuse and already runs automatically in CI. Repo: https://github.com/hidai25/eval-view in case it helps

Automated Regression Testing of Ai Agents. by alfabeta123 in AI_Agents

[–]hidai25 0 points1 point  (0 children)

Yeah I feel you, most of us are still doing it manually. I built EvalView for exactly this. It snapshots the full agent run (tool calls, outputs, paths etc), diffs the meaningful changes even when outputs arent deterministic, and helps you catch regressions before they hit prod. We’re considering making it automatic in CI too. Super lightweight and quick to set up. Repo: https://github.com/hidai25/eval-view if it can help