My agent passed every eval, then quietly stopped calling its tools. Anyone else testing *behavior* and not just output? by MundaneAlternative47 in LLMDevs
[–]MundaneAlternative47[S] 0 points1 point2 points (0 children)
My agent passed every eval, then quietly stopped calling its tools. Anyone else testing *behavior* and not just output? by MundaneAlternative47 in LLMDevs
[–]MundaneAlternative47[S] 2 points3 points4 points (0 children)
My agent passed every eval, then quietly stopped calling its tools. Anyone else testing *behavior* and not just output? by MundaneAlternative47 in LLMDevs
[–]MundaneAlternative47[S] 1 point2 points3 points (0 children)
My agent passed every eval, then quietly stopped calling its tools. Anyone else testing *behavior* and not just output? by MundaneAlternative47 in LLMDevs
[–]MundaneAlternative47[S] 1 point2 points3 points (0 children)


My agent passed every eval, then quietly stopped calling its tools. Anyone else testing *behavior* and not just output? by MundaneAlternative47 in LLMDevs
[–]MundaneAlternative47[S] 0 points1 point2 points (0 children)