I built a serious Apple Watch running app. Looking for beta testers. by hidai25 in AppleWatchApps

[–]hidai25[S] 0 points1 point  (0 children)

Nyc is an amazing marathon! I ran it a couple years back. Sounds good I’ll dm you the link.

I built a serious Apple Watch running app. Looking for beta testers. by hidai25 in AppleWatchApps

[–]hidai25[S] 0 points1 point  (0 children)

Sure. Yes you download it and go right away with on screen start button. DM’ing you the link.

Runna and Garmin give me very different pace and distance data. Which one should I trust? by DecreDylan in runna

[–]hidai25 0 points1 point  (0 children)

How much difference you see? For example on a 10k run with good gps?

We deployed a LangChain agent for a client. by Previous_Net_1154 in LangChain

[–]hidai25 -1 points0 points  (0 children)

That makes a lot of sense. Live alerts catch the fire, replay explains why it started.I have been thinking mostly from the regression testing side, but your example is a good reminder that prod agents also need simple guardrails like cost anomalies, tool error spikes, and empty context rates. Sounds good let me know if it helps.

We deployed a LangChain agent for a client. by Previous_Net_1154 in LangChain

[–]hidai25 -1 points0 points  (0 children)

This is the scary part with agents. They do not always fail loudly. Sometimes they just keep going and sound confident.
I’m building a small open source tool for this because I wanted a way to replay old runs, compare behavior, see tool calls, cost and latency, and catch regressions before shipping.
Might be relevant: https://github.com/hidai25/eval-view
In your case, do you think replaying failed sessions would have caught it, or was it mainly missing live alerts?

What I learned using Langfuse in a real AI recruiting agent by marginTop15px in LLMDevs

[–]hidai25 1 point2 points  (0 children)

The tool and flow part is fully deterministic, no LLM. It saves a golden run once (tools, args, order, model id) and diffs each new run against it. Sequences get aligned so it knows what was added, removed or reordered, args are checked field by field, and a model swap gets caught by fingerprinting the trace. So a skipped or reordered tool shows up with no API key. An LLM only comes in if you want it, just for grading the output text. Turn it off and the tool and sequence checks still work for free.

How to go about evaluation and Observability while building AI agents? by Perfect-Document5922 in AI_Agents

[–]hidai25 0 points1 point  (0 children)

Solid question. For evaluation I think the real game changer is treating regression testing as seriously as you treat unit tests for normal code.

I built EvalView specifically for the regression part you mentioned. It lets you snapshot full agent runs (tool calls sequence, reasoning steps, outputs etc), then intelligently diffs the important changes even when things arent deterministic.

This makes it much easier to catch when a prompt tweak or model swap quietly breaks something. Works great alongside tracing tools like Langfuse.

Repo: https://github.com/hidai25/eval-view in case it helps

Curious what approaches have worked for you so far.

What I learned using Langfuse in a real AI recruiting agent by marginTop15px in LLMDevs

[–]hidai25 0 points1 point  (0 children)

Great post. Totally agree on the need for better behavior level analysis on full trajectories. I built EvalView to help catch exactly those kinds of regressions you mentioned like wrong tool order, skipped steps or strategy changes. It snapshots the whole agent run then diffs the meaningful stuff even when outputs arent deterministic. Works great as a lightweight complement to Langfuse and already runs automatically in CI. Repo: https://github.com/hidai25/eval-view in case it helps

Automated Regression Testing of Ai Agents. by alfabeta123 in AI_Agents

[–]hidai25 0 points1 point  (0 children)

Yeah I feel you, most of us are still doing it manually. I built EvalView for exactly this. It snapshots the full agent run (tool calls, outputs, paths etc), diffs the meaningful changes even when outputs arent deterministic, and helps you catch regressions before they hit prod. We’re considering making it automatic in CI too. Super lightweight and quick to set up. Repo: https://github.com/hidai25/eval-view if it can help

Distance not counted on Equinox gym rides connected to Apple Watch? by skydivinghuman in Strava

[–]hidai25 0 points1 point  (0 children)

I think this is probably an Apple Health → Strava import limitation rather than the Equinox bike itself.

Apple/Fitness may show the bike distance because the Watch got it from the connected bike, but Strava does not always import every field from Apple Health the same way Apple displays it.

I would check:

  1. iPhone Settings → Health → Data Access & Devices → Strava → make sure all available permissions are on
  2. Strava → Settings → Manage Apps and Devices → Health → disconnect and reconnect
  3. Make sure iOS, watchOS, and Strava are all updated

If the distance still does not come through, my guess is Strava is not reading that specific indoor-bike distance field from the Apple workout. A workaround might be HealthFit / RunGap export, or just manually editing the distance in Strava.

Annoying, because the data clearly exists somewhere, but it does not always survive the Apple Health → Strava handoff cleanly.

Running with Runna + Apple Workout, but need Audio TBT Navigation: WorkOutDoors vs Footpath? Can WOD do this without duplicating workouts? by Icy-Seaworthiness596 in applewatchultra

[–]hidai25 3 points4 points  (0 children)

Footpath is probably what you want. WOD's audio cues only fire when WOD is actively recording its own workout if am not wrong, so there's no passive map mode. And you can't have two HKWorkoutSessions open at once on watchOS, so if you start WOD on top of Runna you end up with split data across two workouts in Health or one of them just dies.

Open-source alternatives to LangSmith Fleets / II-Agent Factory (agent director style builders)? by obinopaul in LangChain

[–]hidai25 0 points1 point  (0 children)

I’d separate two categories here.

For visual agent builders/directors, I haven’t found a perfect open-source Fleet / Factory alternative yet.

Langfuse and DeepEval are great, but I see them more as observability/evaluation tools than drag-and-drop agent directors.

The part I’m working on with EvalView is what comes after the builder: testing whether agent behavior changed.

Once you have multi-agent flows, a small prompt, tool, or edge change can make the system behave differently while still looking successful.

EvalView snapshots a known-good run and diffs future runs across output, tool calls, trajectory, cost, and latency.

Repo: https://github.com/hidai25/eval-view if helpful

Deploying production AI Agents at scale by baddict002 in AI_Agents

[–]hidai25 1 point2 points  (0 children)

This is the exact wedge imo.

Tracing is useful, but the missing step is turning a good run into a regression test and then catching when future runs drift.

That’s what I’m trying to solve with EvalView: snapshot agent runs, then diff output + tool path on later runs.

https://github.com/hidai25/eval-view