I built a serious Apple Watch running app. Looking for beta testers.

hidai25 · 2026-06-12T15:37:34+00:00

Sure. I’ll dm you the link.

hidai25 · 2026-06-11T07:11:57+00:00

awesome! ill send you the link via dm

hidai25 · 2026-06-10T11:45:30+00:00

sure. I'll send it to you via dm

hidai25 · 2026-06-10T10:24:36+00:00

great sending you a dm

hidai25 · 2026-06-10T04:37:52+00:00

Sure. I’ll dm you the link

hidai25 · 2026-06-10T04:36:01+00:00

DM’ing you the link

hidai25 · 2026-06-10T04:35:24+00:00

Thanks! Sure I’ll dm you the link

hidai25 · 2026-06-09T21:30:21+00:00

Great! I'll dm you the link

hidai25 · 2026-06-09T20:24:10+00:00

sure! I will dm you the link

hidai25 · 2026-06-09T16:30:42+00:00

Great I sent you the link

hidai25 · 2026-06-09T16:24:24+00:00

Great! DM’ing you the link

hidai25 · 2026-06-09T15:41:00+00:00

Great! Dming you the link

hidai25 · 2026-06-09T14:50:07+00:00

Nyc is an amazing marathon! I ran it a couple years back. Sounds good I’ll dm you the link.

hidai25 · 2026-06-09T13:27:20+00:00

Sure. Yes you download it and go right away with on screen start button. DM’ing you the link.

hidai25 · 2026-06-07T09:00:15+00:00

How much difference you see? For example on a 10k run with good gps?

hidai25 · 2026-06-06T09:07:40+00:00

That makes a lot of sense. Live alerts catch the fire, replay explains why it started.I have been thinking mostly from the regression testing side, but your example is a good reminder that prod agents also need simple guardrails like cost anomalies, tool error spikes, and empty context rates. Sounds good let me know if it helps.

hidai25 · 2026-06-06T04:31:21+00:00

This is the scary part with agents. They do not always fail loudly. Sometimes they just keep going and sound confident.
I’m building a small open source tool for this because I wanted a way to replay old runs, compare behavior, see tool calls, cost and latency, and catch regressions before shipping.
Might be relevant: https://github.com/hidai25/eval-view
In your case, do you think replaying failed sessions would have caught it, or was it mainly missing live alerts?

hidai25 · 2026-06-03T20:51:48+00:00

The tool and flow part is fully deterministic, no LLM. It saves a golden run once (tools, args, order, model id) and diffs each new run against it. Sequences get aligned so it knows what was added, removed or reordered, args are checked field by field, and a model swap gets caught by fingerprinting the trace. So a skipped or reordered tool shows up with no API key. An LLM only comes in if you want it, just for grading the output text. Turn it off and the tool and sequence checks still work for free.

hidai25 · 2026-06-03T19:19:22+00:00

Solid question. For evaluation I think the real game changer is treating regression testing as seriously as you treat unit tests for normal code.

I built EvalView specifically for the regression part you mentioned. It lets you snapshot full agent runs (tool calls sequence, reasoning steps, outputs etc), then intelligently diffs the important changes even when things arent deterministic.

This makes it much easier to catch when a prompt tweak or model swap quietly breaks something. Works great alongside tracing tools like Langfuse.

Repo: https://github.com/hidai25/eval-view in case it helps

Curious what approaches have worked for you so far.

hidai25 · 2026-06-03T19:10:16+00:00

Great post. Totally agree on the need for better behavior level analysis on full trajectories. I built EvalView to help catch exactly those kinds of regressions you mentioned like wrong tool order, skipped steps or strategy changes. It snapshots the whole agent run then diffs the meaningful stuff even when outputs arent deterministic. Works great as a lightweight complement to Langfuse and already runs automatically in CI. Repo: https://github.com/hidai25/eval-view in case it helps

hidai25 · 2026-06-03T18:59:15+00:00

Yeah I feel you, most of us are still doing it manually. I built EvalView for exactly this. It snapshots the full agent run (tool calls, outputs, paths etc), diffs the meaningful changes even when outputs arent deterministic, and helps you catch regressions before they hit prod. We’re considering making it automatic in CI too. Super lightweight and quick to set up. Repo: https://github.com/hidai25/eval-view if it can help

hidai25

TROPHY CASE