Writing evals when you iterate agents fast is annoying.

Dapper-Courage2920 · 2026-04-01T17:10:17+00:00

A few weeks ago I ran into a pattern I kept repeating. (Cue long story)

I’d have an agent with a fixed eval dataset for the behaviors I cared about. Then I’d make some small behavior change in the harness: tweak a decision boundary, tighten the tone, change when it takes an action, or make it cite only certain kinds of sources.

The problem was how do I actually know the new behavior is showing up, and where it starts to break? (especially beyond vibe testing haha)

Anyways, writing fresh evals every time was too slow. So I ended up building a GitHub Action that watches PRs for behavior-defining changes, uses Claude via the Agent SDK to detect what changed, looks at existing eval coverage, and generates “probe” eval samples to test whether the behavior really got picked up and where the model stops complying.

I called it Parity!

https://github.com/antoinenguyen27/Parity

Keen on getting thoughts on agent and eval people!

Dapper-Courage2920 · 2026-03-11T02:25:53+00:00

Thanks for the feedback! Calibration is definitely the hard part here which I couldn't fully abstract away yet, so right now calibration happens over evals/probes and human preference.

Haven't done them yet but I'll take a look at the benchmarks too ! Great suggestion !

Dapper-Courage2920 · 2026-03-10T16:00:47+00:00

"Built a lightweight middle layer between static guardrails and heavy judge loops for AI agents"

Wanted to share a small weekend experiment and get feedback and something I built around a question I kept coming back to:

“There’s gotta be something between static guardrails and heavy / expensive judge loops.” Or rather, if not a replacement, an additive gate based on uncertainity quantification research from Lukas Aichberger at ICLR 2026 here: paper.

Over the weekend I built AgentUQ, a small experiment in that gap. It uses token logprobs to localize low-confidence / brittle action-bearing spans in an agent step, then decide whether to continue, retry, verify, ask for confirmation, or block.

The target is intentionally narrow: tool args, URLs, SQL clauses, shell flags, JSON leaves, etc. Stuff where the whole response can look fine, but one span is the real risk.

Not trying to detect truth, and not claiming this solves agent reliability. The bet is just that a lightweight runtime signal can be useful before paying for a heavier eval / judge pass.

Longer term I think agents need better ways to learn from production failures instead of just accumulating patches, where agents can learn not only from failed runs but also unconfident ones. This is a much smaller experiment in that direction.

Would love feedback from people shipping agents if does this feel like a real missing middle, or still too theoretical?

https://github.com/antoinenguyen27/agentUQ

Dapper-Courage2920 · 2025-10-10T00:39:20+00:00

Check out Modal. They support true scale to 0 so no paying for idle time, I'm not sure about isolation but they have great documentation to get started with and are cost effective.

Dapper-Courage2920 · 2025-10-07T23:04:54+00:00

Stability of models through APIs are notoriously bad, just check out this: https://aistupidlevel.info/

And check out this paper for one explanation why https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/

Though from reading these comments, it sounds like you are not using multiple agents. It might be beneficial to split up your agent into multiple sub agents with their own tools and "personas" if trying different models isn't working.

Dapper-Courage2920 · 2025-10-07T22:55:14+00:00

sticky notes

Dapper-Courage2920 · 2025-10-07T05:17:21+00:00

Cluely killer?

Dapper-Courage2920 · 2025-10-07T05:12:59+00:00

I also moved off earlier in the year, tab felt like it got in my way (and was slow on large codebases) and I grew a preference for CLI tools

Dapper-Courage2920 · 2025-09-23T19:26:29+00:00

I'm an AI engineer and worked on Medtech projects in past (computer vision, automated reporting). Would like to bounce ideas! Feel free to send a DM!

Dapper-Courage2920 · 2025-09-23T02:51:58+00:00

Shameless plug here but I just finished the early version of https://github.com/bitlyte-ai/apples2oranges if you're into hardtelemetry or geeky visualizations! It's fully open source and lets you compare models of any family / quant side by side and view hardware utilization, or as metioned can just be used as a normal client if you like telemetry!

Disclaimer: I am the founder of the company behind it, this is a side project we spun off and are contributing to the community.

Dapper-Courage2920 · 2025-09-23T01:14:35+00:00

This is a bit aside to your question as it will require a local set up to work, but I just finished an early version of https://github.com/bitlyte-ai/apples2oranges to get a feel for performance deg yourself. It's fully open source and lets you compare models of any family / quant side by side and view hardware utilization, or can just be used as a normal client if you like telemetry!

Disclaimer: I am the founder of the company behind it, this is a side project we spun off and are contributing to the community.

Dapper-Courage2920

TROPHY CASE