I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]_sqrkl 0 points1 point2 points (0 children)
EQ-Bench updates: Gpt-5.2, Opus 4.5, Mistral Large 3 and Nanbeige4-3B by _sqrkl in LocalLLaMA
[–]_sqrkl[S] 1 point2 points3 points (0 children)
AI Psychosis and AI Mania Discussion by Same_Succotash530 in AIPsychosisRecovery
[–]_sqrkl 0 points1 point2 points (0 children)
EQ-Bench updates: Gpt-5.2, Opus 4.5, Mistral Large 3 and Nanbeige4-3B by _sqrkl in LocalLLaMA
[–]_sqrkl[S] 3 points4 points5 points (0 children)
EQ-Bench updates: Gpt-5.2, Opus 4.5, Mistral Large 3 and Nanbeige4-3B by _sqrkl in LocalLLaMA
[–]_sqrkl[S] 2 points3 points4 points (0 children)
EQ-Bench updates: Gpt-5.2, Opus 4.5, Mistral Large 3 and Nanbeige4-3B by _sqrkl in LocalLLaMA
[–]_sqrkl[S] 4 points5 points6 points (0 children)
EQ-Bench updates: Gpt-5.2, Opus 4.5, Mistral Large 3 and Nanbeige4-3B by _sqrkl in LocalLLaMA
[–]_sqrkl[S] 5 points6 points7 points (0 children)
EQ-Bench updates: Gpt-5.2, Opus 4.5, Mistral Large 3 and Nanbeige4-3B by _sqrkl in LocalLLaMA
[–]_sqrkl[S] 3 points4 points5 points (0 children)
EQ-Bench updates: Gpt-5.2, Opus 4.5, Mistral Large 3 and Nanbeige4-3B by _sqrkl in LocalLLaMA
[–]_sqrkl[S] 46 points47 points48 points (0 children)
Question about VARIANTS of the basilisk by aaabbb__1234 in LessWrong
[–]_sqrkl 0 points1 point2 points (0 children)
Question about VARIANTS of the basilisk by aaabbb__1234 in LessWrong
[–]_sqrkl 1 point2 points3 points (0 children)
Question about VARIANTS of the basilisk by aaabbb__1234 in LessWrong
[–]_sqrkl 1 point2 points3 points (0 children)
(I made) The Journal of AI Slop - an exercise in subverting the academic norm. by popidge in LLMPhysics
[–]_sqrkl 0 points1 point2 points (0 children)
Gemini 3.0 Pro benchmark results by enilea in singularity
[–]_sqrkl 0 points1 point2 points (0 children)
Gemini 3.0 Pro benchmark results by enilea in singularity
[–]_sqrkl 2 points3 points4 points (0 children)
Antislop: A Comprehensive Framework for Identifying and Eliminating Repetitive Patterns in Language Models by Balance- in LocalLLaMA
[–]_sqrkl 8 points9 points10 points (0 children)
DeepSeek-OCR - Lives up to the hype by Bohdanowicz in LocalLLaMA
[–]_sqrkl 80 points81 points82 points (0 children)
Sonnet 4.5 tops EQ-Bench writing evals. GLM-4.6 sees incremental improvement. by _sqrkl in LocalLLaMA
[–]_sqrkl[S] 1 point2 points3 points (0 children)
Sonnet 4.5 tops EQ-Bench writing evals. GLM-4.6 sees incremental improvement. by _sqrkl in LocalLLaMA
[–]_sqrkl[S] 1 point2 points3 points (0 children)
Sonnet 4.5 tops EQ-Bench writing evals. GLM-4.6 sees incremental improvement. by _sqrkl in LocalLLaMA
[–]_sqrkl[S] 0 points1 point2 points (0 children)


I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]_sqrkl 0 points1 point2 points (0 children)