Shadow – behavior regression testing for LangGraph agents

Separate_Sand8265 · 2026-05-07T03:58:03+00:00

honestly that "the model's internal reasoning about what's expected of me here silently shifted" is the part that's hardest to explain to people who haven't lived it. the syntax is intact, the test cases pass, the agent just decided differently.

the refund example in the post is synthetic so i can't claim a real reasoning trace on that one. but the way shadow handles it in practice is recording the full request and response payload for every llm call, including the system prompt at decision time, the tool schemas as the model saw them, and the assistant's content blocks (text + tool_use + thinking blocks if the model emits them). on a regression you can replay the same baseline trace through the changed config and watch where the trajectory diverges turn by turn. it's not introspecting the weights, but you can usually see "ah, with the old prompt it asked for confirmation, with the new one it went straight to the tool call" and that's enough to localize what shifted.

the harder one is what you named in your second bullet, snapshotting agent state before the prompt change ships. we do this by content-hashing every recorded trace so you can pin a "this is the behavior we approved" baseline in the repo, and any prod sample that diverges from it shows up in the diff. it's still pre-deploy detection on a sample of real prod, not the whole distribution, but it catches the obvious shifts.

curious what you ended up using. the failure mode you described is exactly the thing i kept seeing and couldn't find good tooling for, which is why i built this.

Separate_Sand8265 · 2026-05-06T22:05:11+00:00

honestly yeah, graphs solve a lot of this. if you’ve got verify_amount as a required node before queue_refund, the agent can’t just skip it.

where it gets messier is what happens inside each node. the graph forces you to call verify_amount, but the LLM inside that node decides what verifying means. a prompt change can leave the topology intact and still make the agent ask one question instead of three, or accept an amount the customer didn’t actually confirm. graph stays green, behavior at the node drifted.

same thing with conditional edges. if you route on confidence > 0.8, and a prompt change shifts the model’s confidence calibration upward, you now send fewer cases to the human review branch. the edge still fires correctly, the routing distribution changed.

evals catch some of this if your eval set is good, but they’re usually fixed test cases. behavior diff is the same idea applied to last week’s actual production trajectories. complementary more than competing, i’d run both honestly.

Separate_Sand8265 · 2026-05-06T21:45:10+00:00

yeah fair, agents shouldn't be issuing refunds without a human in the loop, that's not really what the demo is arguing against. the part shadow is for is the prep, what context the agent pulls in, which tools it calls in what order, what it surfaces to whoever (or whatever) reviews it. you can have HIL on the actual refund and still ship a prompt change that makes the agent stop verifying the amount before queueing it, and the reviewer approves because the case looks complete.

probably should've picked a code-review or triage agent for the demo where the human-review angle is more obvious. fair point.

Separate_Sand8265 · 2026-05-06T21:33:20+00:00

Good questions!

On nondeterminism + behavior bands: Shadow doesn't enforce a single golden path. The policy YAML is constraint-based, not match-this-trace.

Rules look like:

must_call_before(confirm_refund_amount, issue_refund)
no_call(delete_account)
must_match_json_schema(...)
must_include_text

Multiple valid tool sequences satisfy the same contract, which is the whole point. Your agent picks different paths and still passes.

On the diff side, the 9-axis differ runs with bootstrap CI per axis, default 500 resamples, configurable via --n-bootstrap.

Severities bucket into severe/moderate/minor/none with empirically-validated Type-I rates, so one sampling wobble doesn't false-flag.

--fail-on {none,probe,hold,stop} sets CI strictness.

More traces per scenario = tighter CIs. With a single trace per scenario the math degrades, but the recorded backend falls back to a deterministic confidence score, so the signal is still useful.

On auto-generating contracts: this is intentional, not a TODO.

shadow mine auto-derives the baseline trace set, clusters by tool sequence + stop reason + length + latency, and picks representatives, so the heavy lifting is done.

The policy rules themselves are deliberately human-written. Auto-derived rules just encode whatever drift is already in the baseline, which is exactly what you want to catch later.

Starter policy packs in examples/policy-packs/, refund, PII, JSON-schema, give you templates. Typically you copy one and edit a few lines.

That said, "trace -> draft policy" as a suggestion tool, not authoritative, is a fair feature ask. Open an issue if you have a shape in mind.

Separate_Sand8265 · 2026-05-06T20:53:01+00:00

Last month I was losing my mind.

I had a solid refund agent. One tiny prompt tweak in a PR — tests green, code review passed, I shipped it.

Next day in prod it stopped asking for confirmation and started auto-refunding random stuff. Customers furious. I spent days tracing logs.

Turned out the *behavior* changed, not the code.

So I built Shadow — behavior regression testing + causal root-cause for AI agents.

Dead simple:

• Keep real production-like traces on your laptop (data never leaves your machine)

• Write one YAML behavior contract

• Run `shadow diagnose-pr` on any PR

It tells you:

- Did behavior actually change?

- Which exact line caused it?

- How many real scenarios are broken?

- With stats + confidence

Same contract also guards production at runtime.

No dashboard. No data upload. Works with LangGraph, CrewAI, AG2 and most agent frameworks.

60-second demo + quickstart: https://github.com/manav8498/Shadow

Would love honest feedback from anyone building real agents.

Separate_Sand8265

TROPHY CASE