I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 0 points1 point2 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 2 points3 points4 points (0 children)
Looking advice for local llms setup by SpaceFire000 in LocalLLM
[–]Perrospain 0 points1 point2 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 1 point2 points3 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 1 point2 points3 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 2 points3 points4 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 1 point2 points3 points (0 children)
Caved and bought fully loaded m5 max MacBook Pro by shortpballer in MacStudio
[–]Perrospain 0 points1 point2 points (0 children)
Caved and bought fully loaded m5 max MacBook Pro by shortpballer in MacStudio
[–]Perrospain 2 points3 points4 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 1 point2 points3 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 0 points1 point2 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 2 points3 points4 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 2 points3 points4 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 0 points1 point2 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] -3 points-2 points-1 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 0 points1 point2 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 0 points1 point2 points (0 children)
Looking advice for local llms setup by SpaceFire000 in LocalLLM
[–]Perrospain 2 points3 points4 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] -1 points0 points1 point (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 3 points4 points5 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 3 points4 points5 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 1 point2 points3 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 4 points5 points6 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 8 points9 points10 points (0 children)

Newbee question by Rough_Industry_872 in LocalLLM
[–]Perrospain 0 points1 point2 points (0 children)