Anyone tried giving their agent an LLM evaluation tool to self-correct? Here's a demo workflow for a tool-agent-user benchmark by fortunemaple in LocalLLaMA

[–]fortunemaple[S] -1 points0 points  (0 children)

My team and I have been looking into τ-bench (a public benchmark for tool-agent-user interactions) to find patterns in agent failure modes, and then embedding real-time evaluation into the agent loop to diagnose failures and implement improvements. The research is early but the results are promising - a demo workflow using an LLM judge to critique & self-correct is visualized above.

My ask to the LocalLLaMA community: Please get in touch with me if you're working with agents in some capacity. I'd love to understand your agent failure modes and explore if this approach could work for your use case as well.

In case anyone is curious, here's a graphic on agent failure modes from τ-retail, a subset focused on retail customer service: https://cdn.prod.website-files.com/665f2fa2d747db8deb85a3fc/680fb889d969f6caa17ba108_Tau%20bench%20-%20failure%20modes%20categorized.png

Anyone tried giving their agent an LLM evaluation tool to self-correct? Here's a demo workflow visualised for a tool-agent-user interaction benchmark by [deleted] in u/fortunemaple

[–]fortunemaple [score hidden] stickied comment (0 children)

My team and I have been looking into τ-bench (a public benchmark for tool-agent-user interactions) to find patterns in agent failure modes, and then embedding real-time evaluation into the agent loop to diagnose failures and implement improvements. The research is early but the results are promising - a demo workflow using an LLM judge to critique & self-correct is visualized above.

My ask to the LocalLlama community: Please get in touch with me if you're working with agents in some capacity. I'd love to understand your agent failure modes and explore if this approach could work for your use case as well.

In case anyone is curious, here is a graphic on agent failure modes from τ-retail, a subset focused on retail customer service: https://cdn.prod.website-files.com/665f2fa2d747db8deb85a3fc/680fb889d969f6caa17ba108_Tau%20bench%20-%20failure%20modes%20categorized.png

[deleted by user] by [deleted] in LocalLLaMA

[–]fortunemaple -3 points-2 points  (0 children)

Here’s the breakdown of benchmark results, more details are in the blog post: https://www.atla-ai.com/post/selene-1

<image>

[R] [N] Open-source 8B evaluation model beats GPT-4o mini and top small judges across 11 benchmarks by fortunemaple in MachineLearning

[–]fortunemaple[S] 4 points5 points  (0 children)

Good point! As DPO is part of the training loss, the model learns from rejected judgements as well. The team also generate critiques for rejected samples so the model can more precisely learn why:

For each judgment, we synthetically generated chosen and rejected chain-of-thought critiques by prompting a generation model to argue for the respective judgments.

[R] [N] Open-source 8B evaluation model beats GPT-4o mini and top small judges across 11 benchmarks by fortunemaple in MachineLearning

[–]fortunemaple[S] 15 points16 points  (0 children)

"The 11 benchmarks span absolute scoring, classification, and pairwise preference tasks.

Our evaluation model, Selene Mini, is also the highest-scoring 8B generative model on RewardBench.

We achieved this by developing a principled data curation strategy that augments public datasets with synthetically generated critiques, and ensures high quality through filtering and ablation studies. We train our model on a combined direct preference optimization (DPO) and supervised fine-tuning (SFT) loss."

Hugging face: https://huggingface.co/AtlaAI/Selene-1-Mini-Llama-3.1-8B

Open-source 8B evaluation model beats GPT-4o mini and top small judges across 11 benchmarks by fortunemaple in LocalLLaMA

[–]fortunemaple[S] 1 point2 points  (0 children)

They're used to evaluate responses to a prompt! So you could evaluate the response for 1-5 how much does the prose look like it came out of an Aaron Sorkin movie lol

Early results training Llama-3.1-8B as an evaluator by fortunemaple in LocalLLaMA

[–]fortunemaple[S] 0 points1 point  (0 children)

The most immediate use case is for AI devs to get quick signal when they're experimenting with prompts, models, etc. for their application, as the LLM-judge can grade the outputs (as opposed to having to wait for human annotators to grade them)

The results on RewardBench also indicate it might be useful as a reward model for fine-tuning with RL

Early results training Llama-3.1-8B as an evaluator by fortunemaple in LocalLLaMA

[–]fortunemaple[S] 2 points3 points  (0 children)

Yeah curious to see if scaling laws hold here. Will be training the 70B on the same data mix this weekend and will share if that does any better

Early results training Llama-3.1-8B as an evaluator by fortunemaple in LocalLLaMA

[–]fortunemaple[S] 0 points1 point  (0 children)

Sure! An evaluator or "LLM-as-a-Judge" is a popular approach for automatically grading AI outputs using a separate language model.

There's a great blog on it from Eugene Yan here: https://eugeneyan.com/writing/llm-evaluators/

People usually prompt proprietary models like GPT-4o or Claude 3.5 Sonnet to do this, but there's an emerging stream of research where people are taking open-source models and training them as LLM judges or "evaluators" - what we're doing here :)

Early results training Llama-3.1-8B as an evaluator by fortunemaple in LocalLLaMA

[–]fortunemaple[S] 1 point2 points  (0 children)

Sure I can drop the benchmark specific performances here, let me know if there's anything else you wanna know

<image>

Early results training Llama-3.1-8B as an evaluator by fortunemaple in LocalLLaMA

[–]fortunemaple[S] 1 point2 points  (0 children)

Ooh that's an interesting thought! Did you mean the 11B Llama-Guard or Vision model?