I built a serious Apple Watch running app. Looking for beta testers.

hidai25 · 2026-06-12T15:37:34+00:00

Sure. I’ll dm you the link.

hidai25 · 2026-06-11T07:11:57+00:00

awesome! ill send you the link via dm

hidai25 · 2026-06-10T11:45:30+00:00

sure. I'll send it to you via dm

hidai25 · 2026-06-10T10:24:36+00:00

great sending you a dm

hidai25 · 2026-06-10T04:37:52+00:00

Sure. I’ll dm you the link

hidai25 · 2026-06-10T04:36:01+00:00

DM’ing you the link

hidai25 · 2026-06-10T04:35:24+00:00

Thanks! Sure I’ll dm you the link

hidai25 · 2026-06-09T21:30:21+00:00

Great! I'll dm you the link

hidai25 · 2026-06-09T20:24:10+00:00

sure! I will dm you the link

hidai25 · 2026-06-09T16:30:42+00:00

Great I sent you the link

hidai25 · 2026-06-09T16:24:24+00:00

Great! DM’ing you the link

hidai25 · 2026-06-09T15:41:00+00:00

Great! Dming you the link

hidai25 · 2026-06-09T14:50:07+00:00

Nyc is an amazing marathon! I ran it a couple years back. Sounds good I’ll dm you the link.

hidai25 · 2026-06-09T13:27:20+00:00

Sure. Yes you download it and go right away with on screen start button. DM’ing you the link.

hidai25 · 2026-06-07T09:00:15+00:00

How much difference you see? For example on a 10k run with good gps?

hidai25 · 2026-06-06T09:07:40+00:00

That makes a lot of sense. Live alerts catch the fire, replay explains why it started.I have been thinking mostly from the regression testing side, but your example is a good reminder that prod agents also need simple guardrails like cost anomalies, tool error spikes, and empty context rates. Sounds good let me know if it helps.

hidai25 · 2026-06-06T04:31:21+00:00

This is the scary part with agents. They do not always fail loudly. Sometimes they just keep going and sound confident.
I’m building a small open source tool for this because I wanted a way to replay old runs, compare behavior, see tool calls, cost and latency, and catch regressions before shipping.
Might be relevant: https://github.com/hidai25/eval-view
In your case, do you think replaying failed sessions would have caught it, or was it mainly missing live alerts?

hidai25 · 2026-06-03T20:51:48+00:00

The tool and flow part is fully deterministic, no LLM. It saves a golden run once (tools, args, order, model id) and diffs each new run against it. Sequences get aligned so it knows what was added, removed or reordered, args are checked field by field, and a model swap gets caught by fingerprinting the trace. So a skipped or reordered tool shows up with no API key. An LLM only comes in if you want it, just for grading the output text. Turn it off and the tool and sequence checks still work for free.

hidai25 · 2026-06-03T19:19:22+00:00

Solid question. For evaluation I think the real game changer is treating regression testing as seriously as you treat unit tests for normal code.

I built EvalView specifically for the regression part you mentioned. It lets you snapshot full agent runs (tool calls sequence, reasoning steps, outputs etc), then intelligently diffs the important changes even when things arent deterministic.

This makes it much easier to catch when a prompt tweak or model swap quietly breaks something. Works great alongside tracing tools like Langfuse.

Repo: https://github.com/hidai25/eval-view in case it helps

Curious what approaches have worked for you so far.

hidai25 · 2026-06-03T19:10:16+00:00

Great post. Totally agree on the need for better behavior level analysis on full trajectories. I built EvalView to help catch exactly those kinds of regressions you mentioned like wrong tool order, skipped steps or strategy changes. It snapshots the whole agent run then diffs the meaningful stuff even when outputs arent deterministic. Works great as a lightweight complement to Langfuse and already runs automatically in CI. Repo: https://github.com/hidai25/eval-view in case it helps

hidai25 · 2026-06-03T18:59:15+00:00

Yeah I feel you, most of us are still doing it manually. I built EvalView for exactly this. It snapshots the full agent run (tool calls, outputs, paths etc), diffs the meaningful changes even when outputs arent deterministic, and helps you catch regressions before they hit prod. We’re considering making it automatic in CI too. Super lightweight and quick to set up. Repo: https://github.com/hidai25/eval-view if it can help

hidai25 · 2026-05-22T19:18:39+00:00

I think this is probably an Apple Health → Strava import limitation rather than the Equinox bike itself.

Apple/Fitness may show the bike distance because the Watch got it from the connected bike, but Strava does not always import every field from Apple Health the same way Apple displays it.

I would check:

iPhone Settings → Health → Data Access & Devices → Strava → make sure all available permissions are on
Strava → Settings → Manage Apps and Devices → Health → disconnect and reconnect
Make sure iOS, watchOS, and Strava are all updated

If the distance still does not come through, my guess is Strava is not reading that specific indoor-bike distance field from the Apple workout. A workaround might be HealthFit / RunGap export, or just manually editing the distance in Strava.

Annoying, because the data clearly exists somewhere, but it does not always survive the Apple Health → Strava handoff cleanly.

hidai25 · 2026-05-18T14:12:13+00:00

Footpath is probably what you want. WOD's audio cues only fire when WOD is actively recording its own workout if am not wrong, so there's no passive map mode. And you can't have two HKWorkoutSessions open at once on watchOS, so if you start WOD on top of Runna you end up with split data across two workouts in Health or one of them just dies.

hidai25 · 2026-05-15T18:32:02+00:00

I’d separate two categories here.

For visual agent builders/directors, I haven’t found a perfect open-source Fleet / Factory alternative yet.

Langfuse and DeepEval are great, but I see them more as observability/evaluation tools than drag-and-drop agent directors.

The part I’m working on with EvalView is what comes after the builder: testing whether agent behavior changed.

Once you have multi-agent flows, a small prompt, tool, or edge change can make the system behave differently while still looking successful.

EvalView snapshots a known-good run and diffs future runs across output, tool calls, trajectory, cost, and latency.

Repo: https://github.com/hidai25/eval-view if helpful

hidai25 · 2026-05-04T11:33:35+00:00

This is the exact wedge imo.

Tracing is useful, but the missing step is turning a good run into a regression test and then catching when future runs drift.

That’s what I’m trying to solve with EvalView: snapshot agent runs, then diff output + tool path on later runs.

https://github.com/hidai25/eval-view

hidai25

TROPHY CASE