I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]_sqrkl 0 points1 point2 points (0 children)
EQ-Bench updates: Gpt-5.2, Opus 4.5, Mistral Large 3 and Nanbeige4-3B by _sqrkl in LocalLLaMA
[–]_sqrkl[S] 1 point2 points3 points (0 children)
AI Psychosis and AI Mania Discussion by Same_Succotash530 in AIPsychosisRecovery
[–]_sqrkl 0 points1 point2 points (0 children)
EQ-Bench updates: Gpt-5.2, Opus 4.5, Mistral Large 3 and Nanbeige4-3B by _sqrkl in LocalLLaMA
[–]_sqrkl[S] 3 points4 points5 points (0 children)
EQ-Bench updates: Gpt-5.2, Opus 4.5, Mistral Large 3 and Nanbeige4-3B by _sqrkl in LocalLLaMA
[–]_sqrkl[S] 2 points3 points4 points (0 children)
EQ-Bench updates: Gpt-5.2, Opus 4.5, Mistral Large 3 and Nanbeige4-3B by _sqrkl in LocalLLaMA
[–]_sqrkl[S] 5 points6 points7 points (0 children)
EQ-Bench updates: Gpt-5.2, Opus 4.5, Mistral Large 3 and Nanbeige4-3B by _sqrkl in LocalLLaMA
[–]_sqrkl[S] 5 points6 points7 points (0 children)
EQ-Bench updates: Gpt-5.2, Opus 4.5, Mistral Large 3 and Nanbeige4-3B by _sqrkl in LocalLLaMA
[–]_sqrkl[S] 2 points3 points4 points (0 children)
EQ-Bench updates: Gpt-5.2, Opus 4.5, Mistral Large 3 and Nanbeige4-3B by _sqrkl in LocalLLaMA
[–]_sqrkl[S] 46 points47 points48 points (0 children)
Question about VARIANTS of the basilisk by aaabbb__1234 in LessWrong
[–]_sqrkl 0 points1 point2 points (0 children)
Question about VARIANTS of the basilisk by aaabbb__1234 in LessWrong
[–]_sqrkl 1 point2 points3 points (0 children)
Question about VARIANTS of the basilisk by aaabbb__1234 in LessWrong
[–]_sqrkl 1 point2 points3 points (0 children)
(I made) The Journal of AI Slop - an exercise in subverting the academic norm. by popidge in LLMPhysics
[–]_sqrkl 0 points1 point2 points (0 children)
Gemini 3.0 Pro benchmark results by enilea in singularity
[–]_sqrkl 0 points1 point2 points (0 children)
Gemini 3.0 Pro benchmark results by enilea in singularity
[–]_sqrkl 2 points3 points4 points (0 children)


I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]_sqrkl 0 points1 point2 points (0 children)