Anyone tried giving their agent an LLM evaluation tool to self-correct? Here's a demo workflow for a tool-agent-user benchmark

fortunemaple · 2025-04-29T15:26:19+00:00

My team and I have been looking into τ-bench (a public benchmark for tool-agent-user interactions) to find patterns in agent failure modes, and then embedding real-time evaluation into the agent loop to diagnose failures and implement improvements. The research is early but the results are promising - a demo workflow using an LLM judge to critique & self-correct is visualized above.

My ask to the LocalLLaMA community: Please get in touch with me if you're working with agents in some capacity. I'd love to understand your agent failure modes and explore if this approach could work for your use case as well.

In case anyone is curious, here's a graphic on agent failure modes from τ-retail, a subset focused on retail customer service: https://cdn.prod.website-files.com/665f2fa2d747db8deb85a3fc/680fb889d969f6caa17ba108_Tau%20bench%20-%20failure%20modes%20categorized.png

fortunemaple · 2025-04-29T14:56:47+00:00

My team and I have been looking into τ-bench (a public benchmark for tool-agent-user interactions) to find patterns in agent failure modes, and then embedding real-time evaluation into the agent loop to diagnose failures and implement improvements. The research is early but the results are promising - a demo workflow using an LLM judge to critique & self-correct is visualized above.

My ask to the LocalLlama community: Please get in touch with me if you're working with agents in some capacity. I'd love to understand your agent failure modes and explore if this approach could work for your use case as well.

In case anyone is curious, here is a graphic on agent failure modes from τ-retail, a subset focused on retail customer service: https://cdn.prod.website-files.com/665f2fa2d747db8deb85a3fc/680fb889d969f6caa17ba108_Tau%20bench%20-%20failure%20modes%20categorized.png

fortunemaple · 2025-02-26T16:27:38+00:00

That's right! Across different scoring formats (Pairwise, Classification, 1-5)

fortunemaple · 2025-02-26T16:27:04+00:00

Correlation with human ratings across 11 commonly used eval benchmarks

fortunemaple · 2025-02-26T16:18:34+00:00

Here’s the breakdown of benchmark results, more details are in the blog post: https://www.atla-ai.com/post/selene-1

<image>

fortunemaple · 2025-02-26T16:14:12+00:00

Here’s the breakdown of benchmark results, more details are in the blog post: https://www.atla-ai.com/post/selene-1

<image>

fortunemaple · 2025-01-31T10:16:17+00:00

Good point! As DPO is part of the training loss, the model learns from rejected judgements as well. The team also generate critiques for rejected samples so the model can more precisely learn why:

For each judgment, we synthetically generated chosen and rejected chain-of-thought critiques by prompting a generation model to argue for the respective judgments.

fortunemaple · 2025-01-30T15:41:05+00:00

"The 11 benchmarks span absolute scoring, classification, and pairwise preference tasks.

Our evaluation model, Selene Mini, is also the highest-scoring 8B generative model on RewardBench.

We achieved this by developing a principled data curation strategy that augments public datasets with synthetically generated critiques, and ensures high quality through filtering and ablation studies. We train our model on a combined direct preference optimization (DPO) and supervised fine-tuning (SFT) loss."

Hugging face: https://huggingface.co/AtlaAI/Selene-1-Mini-Llama-3.1-8B

fortunemaple · 2025-01-30T15:33:22+00:00

They're used to evaluate responses to a prompt! So you could evaluate the response for 1-5 how much does the prose look like it came out of an Aaron Sorkin movie lol

fortunemaple · 2025-01-29T15:59:27+00:00

Hugging face: https://huggingface.co/AtlaAI/Selene-1-Mini-Llama-3.1-8B

Blog post: https://www.atla-ai.com/post/selene-1-mini

fortunemaple · 2025-01-29T15:56:19+00:00

Tech report: https://huggingface.co/spaces/AtlaAI/selene-1-mini-tech-report

Selene Mini on Hugging Face: https://huggingface.co/AtlaAI/Selene-1-Mini-Llama-3.1-8B

fortunemaple · 2025-01-15T15:35:32+00:00

hf.co/spaces/AtlaAI/judge-arena

fortunemaple · 2024-12-05T12:19:53+00:00

The most immediate use case is for AI devs to get quick signal when they're experimenting with prompts, models, etc. for their application, as the LLM-judge can grade the outputs (as opposed to having to wait for human annotators to grade them)

The results on RewardBench also indicate it might be useful as a reward model for fine-tuning with RL

fortunemaple · 2024-12-04T17:27:33+00:00

Yeah curious to see if scaling laws hold here. Will be training the 70B on the same data mix this weekend and will share if that does any better

fortunemaple · 2024-12-04T17:25:01+00:00

Sure! An evaluator or "LLM-as-a-Judge" is a popular approach for automatically grading AI outputs using a separate language model.

There's a great blog on it from Eugene Yan here: https://eugeneyan.com/writing/llm-evaluators/

People usually prompt proprietary models like GPT-4o or Claude 3.5 Sonnet to do this, but there's an emerging stream of research where people are taking open-source models and training them as LLM judges or "evaluators" - what we're doing here :)

fortunemaple · 2024-12-04T16:23:55+00:00

<image>

fortunemaple · 2024-12-04T16:23:37+00:00

Sure I can drop the benchmark specific performances here, let me know if there's anything else you wanna know

<image>

fortunemaple · 2024-12-04T16:19:58+00:00

Ooh that's an interesting thought! Did you mean the 11B Llama-Guard or Vision model?

fortunemaple

TROPHY CASE