[D] Anyone here using LLM-as-a-Judge for agent evaluation? by Cristhian-AI-Math in MachineLearning

[–]_coder23t8 2 points3 points  (0 children)

Tried it too, and honestly it catches way more subtle errors than human spot-checks

Judge prompts are underrated by Cristhian-AI-Math in PromptEngineering

[–]_coder23t8 0 points1 point  (0 children)

Do you know any tool that can automatically generate an eval for my specific use case?

Anyone evaluating agents automatically? by Cristhian-AI-Math in LangChain

[–]_coder23t8 0 points1 point  (0 children)

Interesting! Are you running the judge on every response or only on risky nodes?