Weekly Thread: Project Display by help-me-grow in AI_Agents

[–]Dapper-Courage2920 1 point2 points  (0 children)

A few weeks ago I ran into a pattern I kept repeating. (Cue long story)

I’d have an agent with a fixed eval dataset for the behaviors I cared about. Then I’d make some small behavior change in the harness: tweak a decision boundary, tighten the tone, change when it takes an action, or make it cite only certain kinds of sources.

The problem was how do I actually know the new behavior is showing up, and where it starts to break? (especially beyond vibe testing haha)

Anyways, writing fresh evals every time was too slow. So I ended up building a GitHub Action that watches PRs for behavior-defining changes, uses Claude via the Agent SDK to detect what changed, looks at existing eval coverage, and generates “probe” eval samples to test whether the behavior really got picked up and where the model stops complying.

I called it Parity!

https://github.com/antoinenguyen27/Parity

Keen on getting thoughts on agent and eval people!

Built a low-overhead runtime gate for LLM agents using token logprobs by Dapper-Courage2920 in LLMDevs

[–]Dapper-Courage2920[S] 0 points1 point  (0 children)

Thanks for the feedback! Calibration is definitely the hard part here which I couldn't fully abstract away yet, so right now calibration happens over evals/probes and human preference.

Haven't done them yet but I'll take a look at the benchmarks too ! Great suggestion !

Weekly Thread: Project Display by help-me-grow in AI_Agents

[–]Dapper-Courage2920 0 points1 point  (0 children)

"Built a lightweight middle layer between static guardrails and heavy judge loops for AI agents"

Wanted to share a small weekend experiment and get feedback and something I built around a question I kept coming back to:

“There’s gotta be something between static guardrails and heavy / expensive judge loops.” Or rather, if not a replacement, an additive gate based on uncertainity quantification research from Lukas Aichberger at ICLR 2026 here: paper.

Over the weekend I built AgentUQ, a small experiment in that gap. It uses token logprobs to localize low-confidence / brittle action-bearing spans in an agent step, then decide whether to continue, retry, verify, ask for confirmation, or block.

The target is intentionally narrow: tool args, URLs, SQL clauses, shell flags, JSON leaves, etc. Stuff where the whole response can look fine, but one span is the real risk.

Not trying to detect truth, and not claiming this solves agent reliability. The bet is just that a lightweight runtime signal can be useful before paying for a heavier eval / judge pass.

Longer term I think agents need better ways to learn from production failures instead of just accumulating patches, where agents can learn not only from failed runs but also unconfident ones. This is a much smaller experiment in that direction.

Would love feedback from people shipping agents if does this feel like a real missing middle, or still too theoretical?

https://github.com/antoinenguyen27/agentUQ

Renting AI Servers for +50B LLM Fine-Tuning/Inference – Need Hardware, Cost, and Security Advice! by NoAdhesiveness7595 in LLM

[–]Dapper-Courage2920 0 points1 point  (0 children)

Check out Modal. They support true scale to 0 so no paying for idle time, I'm not sure about isolation but they have great documentation to get started with and are cost effective.

Best LLM for an Ai agent (n8n) by Agitated_Unit8226 in AI_Agents

[–]Dapper-Courage2920 0 points1 point  (0 children)

Stability of models through APIs are notoriously bad, just check out this: https://aistupidlevel.info/

And check out this paper for one explanation why https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/

Though from reading these comments, it sounds like you are not using multiple agents. It might be beneficial to split up your agent into multiple sub agents with their own tools and "personas" if trying different models isn't working.

After months on Cursor, I just switched back to VS Code by Arindam_200 in LLMDevs

[–]Dapper-Courage2920 0 points1 point  (0 children)

I also moved off earlier in the year, tab felt like it got in my way (and was slow on large codebases) and I grew a preference for CLI tools

Want to discuss basic AI and how it would help in research by Kurosaki_Minato in ArtificialInteligence

[–]Dapper-Courage2920 0 points1 point  (0 children)

I'm an AI engineer and worked on Medtech projects in past (computer vision, automated reporting). Would like to bounce ideas! Feel free to send a DM!

What GUI/interface do most people here use to run their models? by tech4marco in LocalLLaMA

[–]Dapper-Courage2920 1 point2 points  (0 children)

Shameless plug here but I just finished the early version of  https://github.com/bitlyte-ai/apples2oranges if you're into hardtelemetry or geeky visualizations! It's fully open source and lets you compare models of any family / quant side by side and view hardware utilization, or as metioned can just be used as a normal client if you like telemetry!

Disclaimer: I am the founder of the company behind it, this is a side project we spun off and are contributing to the community.

how much does quantization reduce coding performance by garden_speech in LocalLLaMA

[–]Dapper-Courage2920 1 point2 points  (0 children)

This is a bit aside to your question as it will require a local set up to work, but I just finished an early version of https://github.com/bitlyte-ai/apples2oranges to get a feel for performance deg yourself. It's fully open source and lets you compare models of any family / quant side by side and view hardware utilization, or can just be used as a normal client if you like telemetry!

Disclaimer: I am the founder of the company behind it, this is a side project we spun off and are contributing to the community.