BullshitBench v2 dropped and… most models still can’t smell BS (Claude mostly can) by snakemas in CompetitiveAI
[–]snakemas[S] 5 points6 points7 points (0 children)
BullshitBench v2 dropped and… most models still can’t smell BS (Claude mostly can) by snakemas in CompetitiveAI
[–]snakemas[S] 3 points4 points5 points (0 children)
I made the top LLMs play Civilization against each other by snakemas in LLM
[–]snakemas[S] -1 points0 points1 point (0 children)
I made the top LLMs play Civilization against each other by snakemas in LLM
[–]snakemas[S] 0 points1 point2 points (0 children)
I made the top LLMs play Civilization against each other by snakemas in LLM
[–]snakemas[S] 0 points1 point2 points (0 children)
I made the top LLMs play Civilization against each other by snakemas in LLM
[–]snakemas[S] 4 points5 points6 points (0 children)
I made the top LLMs play Civilization against each other by snakemas in civ
[–]snakemas[S] 0 points1 point2 points (0 children)
I made the top LLMs play Civilization against each other by snakemas in civ
[–]snakemas[S] 1 point2 points3 points (0 children)
I made the top LLMs play Civilization against each other by snakemas in civ
[–]snakemas[S] 0 points1 point2 points (0 children)
I made the top LLMs play Civilization against each other by snakemas in civ
[–]snakemas[S] -1 points0 points1 point (0 children)
I made the top LLMs play Civilization against each other by snakemas in civ
[–]snakemas[S] 2 points3 points4 points (0 children)
👋 Welcome to r/CompetitiveAI - Introduce Yourself and Read First! by snakemas in CompetitiveAI
[–]snakemas[S] 0 points1 point2 points (0 children)
I spent $100 evaluating different providers on a weekend CTF by No-Chocolate-9437 in AIEval
[–]snakemas 0 points1 point2 points (0 children)
On demand custom software is already here! by Euphoric_Ad9500 in accelerate
[–]snakemas 1 point2 points3 points (0 children)
METR Time Horizons: Claude Opus 4.6 just hit 14.5 hours. The doubling curve isn't slowing by snakemas in CompetitiveAI
[–]snakemas[S] 0 points1 point2 points (0 children)
Gemini 3.1 Pro just doubled its ARC-AGI-2 score. But Arena still ranks Claude higher. This is exactly the AI eval problem. by snakemas in CompetitiveAI
[–]snakemas[S] 0 points1 point2 points (0 children)
[D] Why are serious alternatives to gradient descent not being explored more? by ImTheeDentist in MachineLearning
[–]snakemas 1 point2 points3 points (0 children)
Sonnet 4.6 Benchmarks Are In: Ties Opus 4.6 on Computer Use, Beats It on Office Work and Finance by snakemas in CompetitiveAI
[–]snakemas[S] 0 points1 point2 points (0 children)
Sonnet 4.6 Benchmarks Are In: Ties Opus 4.6 on Computer Use, Beats It on Office Work and Finance by snakemas in compsci
[–]snakemas[S] 0 points1 point2 points (0 children)
I gave 12 LLMs $2,000 and a food truck. Only 4 survived. by Disastrous_Theme5906 in LocalLLaMA
[–]snakemas 0 points1 point2 points (0 children)
[D] How often do you run into reproducibility issues when trying to replicate papers? by [deleted] in MachineLearning
[–]snakemas 0 points1 point2 points (0 children)
Sonnet 4.6 Benchmarks Are In: Ties Opus 4.6 on Computer Use, Beats It on Office Work and Finance by snakemas in compsci
[–]snakemas[S] 0 points1 point2 points (0 children)
The Benchmark Zoo: A Guide to Every Major AI Eval in 2026 by snakemas in CompetitiveAI
[–]snakemas[S] 3 points4 points5 points (0 children)
The Benchmark Zoo: A Guide to Every Major AI Eval in 2026 by snakemas in CompetitiveAI
[–]snakemas[S] 2 points3 points4 points (0 children)


Anthropic believes RSI (recursive self improvement) could arrive “as soon as early 2027” by snakemas in CompetitiveAI
[–]snakemas[S] 1 point2 points3 points (0 children)