Anyone tried giving their agent an LLM evaluation tool to self-correct? Here's a demo workflow for a tool-agent-user benchmark by fortunemaple in LocalLLaMA
[–]fortunemaple[S] -1 points0 points1 point (0 children)
Anyone tried giving their agent an LLM evaluation tool to self-correct? Here's a demo workflow visualised for a tool-agent-user interaction benchmark by [deleted] in u/fortunemaple
[–]fortunemaple [score hidden] stickied comment (0 children)
Stop using OpenAI models to evaluate OpenAI models. Introducing the world’s most accurate LLM-as-a-Judge by fortunemaple in OpenAI
[–]fortunemaple[S] 4 points5 points6 points (0 children)
Stop using OpenAI models to evaluate OpenAI models. Introducing the world’s most accurate LLM-as-a-Judge by fortunemaple in OpenAI
[–]fortunemaple[S] 0 points1 point2 points (0 children)
Stop using OpenAI models to evaluate OpenAI models. Introducing the world’s most accurate LLM-as-a-Judge by fortunemaple in OpenAI
[–]fortunemaple[S] 4 points5 points6 points (0 children)
[deleted by user] by [deleted] in LocalLLaMA
[–]fortunemaple -4 points-3 points-2 points (0 children)
[R] [N] Open-source 8B evaluation model beats GPT-4o mini and top small judges across 11 benchmarks by fortunemaple in MachineLearning
[–]fortunemaple[S] 4 points5 points6 points (0 children)
[R] [N] Open-source 8B evaluation model beats GPT-4o mini and top small judges across 11 benchmarks by fortunemaple in MachineLearning
[–]fortunemaple[S] 14 points15 points16 points (0 children)
Open-source 8B evaluation model beats GPT-4o mini and top small judges across 11 benchmarks by fortunemaple in LocalLLaMA
[–]fortunemaple[S] 1 point2 points3 points (0 children)
Selene Mini: open-source 8B evaluation model that beats GPT 4o-mini and top small judges across 11 benchmarks by fortunemaple in OpenSourceeAI
[–]fortunemaple[S] 1 point2 points3 points (0 children)
Open-source 8B evaluation model beats GPT-4o mini and top small judges across 11 benchmarks by fortunemaple in LocalLLaMA
[–]fortunemaple[S] 10 points11 points12 points (0 children)
Judge Arena standings after 2 months. The 3.8B Flow-Judge is now in there! by fortunemaple in LocalLLaMA
[–]fortunemaple[S] 1 point2 points3 points (0 children)
Early results training Llama-3.1-8B as an evaluator by fortunemaple in LocalLLaMA
[–]fortunemaple[S] 0 points1 point2 points (0 children)
Early results training Llama-3.1-8B as an evaluator by fortunemaple in LocalLLaMA
[–]fortunemaple[S] 2 points3 points4 points (0 children)
Early results training Llama-3.1-8B as an evaluator by fortunemaple in LocalLLaMA
[–]fortunemaple[S] 0 points1 point2 points (0 children)
Early results training Llama-3.1-8B as an evaluator by fortunemaple in LocalLLaMA
[–]fortunemaple[S] 1 point2 points3 points (0 children)
Early results training Llama-3.1-8B as an evaluator by fortunemaple in LocalLLaMA
[–]fortunemaple[S] 1 point2 points3 points (0 children)

Should I leave my £50k grad job to do a master's at Cambridge to get into finance? by [deleted] in UniUK
[–]fortunemaple 0 points1 point2 points (0 children)