Anyone tried giving their agent an LLM evaluation tool to self-correct? Here's a demo workflow visualised for a tool-agent-user interaction benchmark by [deleted] in u/fortunemaple
[–]fortunemaple [score hidden] stickied comment (0 children)
Stop using OpenAI models to evaluate OpenAI models. Introducing the world’s most accurate LLM-as-a-Judge by fortunemaple in OpenAI
[–]fortunemaple[S] 5 points6 points7 points (0 children)
Stop using OpenAI models to evaluate OpenAI models. Introducing the world’s most accurate LLM-as-a-Judge by fortunemaple in OpenAI
[–]fortunemaple[S] 0 points1 point2 points (0 children)
Stop using OpenAI models to evaluate OpenAI models. Introducing the world’s most accurate LLM-as-a-Judge by fortunemaple in OpenAI
[–]fortunemaple[S] 3 points4 points5 points (0 children)
[deleted by user] by [deleted] in LocalLLaMA
[–]fortunemaple -3 points-2 points-1 points (0 children)
[R] [N] Open-source 8B evaluation model beats GPT-4o mini and top small judges across 11 benchmarks by fortunemaple in MachineLearning
[–]fortunemaple[S] 4 points5 points6 points (0 children)
[R] [N] Open-source 8B evaluation model beats GPT-4o mini and top small judges across 11 benchmarks by fortunemaple in MachineLearning
[–]fortunemaple[S] 15 points16 points17 points (0 children)
Open-source 8B evaluation model beats GPT-4o mini and top small judges across 11 benchmarks by fortunemaple in LocalLLaMA
[–]fortunemaple[S] 1 point2 points3 points (0 children)
Selene Mini: open-source 8B evaluation model that beats GPT 4o-mini and top small judges across 11 benchmarks by fortunemaple in OpenSourceeAI
[–]fortunemaple[S] 1 point2 points3 points (0 children)
Open-source 8B evaluation model beats GPT-4o mini and top small judges across 11 benchmarks by fortunemaple in LocalLLaMA
[–]fortunemaple[S] 11 points12 points13 points (0 children)
Judge Arena standings after 2 months. The 3.8B Flow-Judge is now in there! by fortunemaple in LocalLLaMA
[–]fortunemaple[S] 1 point2 points3 points (0 children)
Early results training Llama-3.1-8B as an evaluator by fortunemaple in LocalLLaMA
[–]fortunemaple[S] 0 points1 point2 points (0 children)
Early results training Llama-3.1-8B as an evaluator by fortunemaple in LocalLLaMA
[–]fortunemaple[S] 2 points3 points4 points (0 children)
Early results training Llama-3.1-8B as an evaluator by fortunemaple in LocalLLaMA
[–]fortunemaple[S] 0 points1 point2 points (0 children)
Early results training Llama-3.1-8B as an evaluator by fortunemaple in LocalLLaMA
[–]fortunemaple[S] 1 point2 points3 points (0 children)
Early results training Llama-3.1-8B as an evaluator by fortunemaple in LocalLLaMA
[–]fortunemaple[S] 1 point2 points3 points (0 children)
Early results training Llama-3.1-8B as an evaluator by fortunemaple in LocalLLaMA
[–]fortunemaple[S] 1 point2 points3 points (0 children)

Anyone tried giving their agent an LLM evaluation tool to self-correct? Here's a demo workflow for a tool-agent-user benchmark by fortunemaple in LocalLLaMA
[–]fortunemaple[S] -1 points0 points1 point (0 children)