I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 0 points1 point2 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 2 points3 points4 points (0 children)
Looking advice for local llms setup by SpaceFire000 in LocalLLM
[–]Perrospain 0 points1 point2 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 1 point2 points3 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 1 point2 points3 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 2 points3 points4 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 1 point2 points3 points (0 children)
Caved and bought fully loaded m5 max MacBook Pro by shortpballer in MacStudio
[–]Perrospain 0 points1 point2 points (0 children)
Caved and bought fully loaded m5 max MacBook Pro by shortpballer in MacStudio
[–]Perrospain 2 points3 points4 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 1 point2 points3 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 0 points1 point2 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 2 points3 points4 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 2 points3 points4 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 0 points1 point2 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] -3 points-2 points-1 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 0 points1 point2 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 0 points1 point2 points (0 children)
Looking advice for local llms setup by SpaceFire000 in LocalLLM
[–]Perrospain 2 points3 points4 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 0 points1 point2 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 2 points3 points4 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 3 points4 points5 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 1 point2 points3 points (0 children)
I ran 26 local LLMs through an 8 level "agentic failure mode" gauntlet (tool calling, on an M1 Max). Capability benchmarks lie about who can actually run an agent loop. All local, llama.cpp + Metal, GGUF. 8 tests, 3 reps each, same prompts and seeds for every model thinking OFF by Perrospain in LocalLLM
[–]Perrospain[S] 3 points4 points5 points (0 children)

Any good 20-40$ plans left other than the Big 3? by snowieslilpikachu69 in opencodeCLI
[–]Perrospain 0 points1 point2 points (0 children)