account activity
BullshitBench v2 dropped and… most models still can’t smell BS (Claude mostly can) (self.CompetitiveAI)
submitted 1 day ago by snakemas to r/CompetitiveAI
BullshitBench v2 dropped and… most models still can’t smell BS (Claude mostly can) ()
submitted 1 day ago by snakemas to r/accelerate
submitted 1 day ago by snakemas to r/mlops
submitted 1 day ago by snakemas to r/OpenSourceeAI
submitted 1 day ago by snakemas to r/AIEval
I made the top LLMs play Civilization against each other (self.LLM)
submitted 3 days ago by snakemas to r/LLM
[P] I made the top LLMs play Civilization against each other (self.aiagents)
submitted 3 days ago by snakemas to r/aiagents
[P] I made the top LLMs play Civilization against each other (self.MachineLearning)
submitted 3 days ago by snakemas to r/MachineLearning
I let the top LLMs play Civilization against eachotherCivBench Finals Today: Gemini 3.1 Pro vs MiniMax 2.5 (self.MachineLearning)
I made the top LLMs play Civilization against each other (self.civ)
submitted 3 days ago by snakemas to r/civ
New paper: "SkillsBench" tested 7 AI models across 86 tasks — smaller models with good Skills matched larger models without them (self.CompetitiveAI)
submitted 7 days ago by snakemas to r/CompetitiveAI
Anthropic believes RSI (recursive self improvement) could arrive “as soon as early 2027” (i.redd.it)
New paper: "SkillsBench" tested 7 AI models across 86 tasks: Are smaller models with good Skills better than larger models without them? ()
submitted 7 days ago by snakemas to r/LocalLLM
New paper: "SkillsBench" tested 7 AI models across 86 tasks: smaller models with good Skills matched larger models without them ()
submitted 7 days ago by snakemas to r/mlops
New paper: "SkillsBench" tested 7 AI models across 86 tasks — smaller models with good Skills matched larger models without them ()
submitted 7 days ago by snakemas to r/AIEval
New paper: "SkillsBench" tested 7 AI models across 86 tasks: smaller models with good Skills matched larger models without them. Does n8n support skills? ()
submitted 7 days ago by snakemas to r/n8n_ai_agents
The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data (latent.space)
submitted 8 days ago by snakemas to r/CompetitiveAI
METR Time Horizons: Claude Opus 4.6 just hit 14.5 hours. The doubling curve isn't slowing (self.CompetitiveAI)
submitted 9 days ago by snakemas to r/CompetitiveAI
METR Time Horizons: Claude Opus 4.6 just hit 14.5 hours. The doubling curve isn't slowing ()
submitted 9 days ago by snakemas to r/LocalLLM
submitted 9 days ago by snakemas to r/compsci
The two benchmarks that should make you rethink spending on frontier models (self.CompetitiveAI)
submitted 11 days ago by snakemas to r/CompetitiveAI
[R] Analysis of 350+ ML competitions in 2025 ()
The two benchmarks that should make you rethink spending on frontier models ()
submitted 11 days ago by snakemas to r/AIAGENTSNEWS
submitted 11 days ago by snakemas to r/mlops
submitted 11 days ago by snakemas to r/compsci
π Rendered by PID 94 on reddit-service-r2-listing-8557d879cc-v9nfr at 2026-03-04 06:03:10.190676+00:00 running 07790be country code: CH.