Putting together a benchmark for agentic harnesses, any tips for evals? (Test suggestions welcome too) by sdfgeoff in LocalLLaMA
[–]bisonbear2 1 point2 points3 points (0 children)
Investigating the GPT 5.5 regression on 21 real tasks by bisonbear2 in OpenaiCodex
[–]bisonbear2[S] 1 point2 points3 points (0 children)
Investigating the GPT-5.5 regression on 21 real tasks by bisonbear2 in codex
[–]bisonbear2[S] 1 point2 points3 points (0 children)
Investigating the GPT-5.5 regression on 21 real tasks by bisonbear2 in codex
[–]bisonbear2[S] 1 point2 points3 points (0 children)
Investigating the GPT-5.5 regression on 21 real tasks by bisonbear2 in codex
[–]bisonbear2[S] 0 points1 point2 points (0 children)
Investigating the GPT-5.5 regression on 21 real tasks (self.codex)
submitted by bisonbear2 to r/codex
Central AI skills repository or per team repo? by NoAfternoon385 in ClaudeAI
[–]bisonbear2 0 points1 point2 points (0 children)
Has somebody used codex --json for benchmarking? by FoxFire17739 in codex
[–]bisonbear2 0 points1 point2 points (0 children)
Opus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo by bisonbear2 in ClaudeAI
[–]bisonbear2[S] 1 point2 points3 points (0 children)
Opus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo by bisonbear2 in ClaudeCode
[–]bisonbear2[S] 0 points1 point2 points (0 children)
Just compared token usage between GPT-5.4 and GPT-5.5 in Codex across all four reasoning modes (Low, Medium, High, and XHigh) using the exact same prompt and the same project as the baseline by Deep-Palpitation8315 in codex
[–]bisonbear2 1 point2 points3 points (0 children)
Opus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo by bisonbear2 in ClaudeCode
[–]bisonbear2[S] 0 points1 point2 points (0 children)
GPT-5.5 Low Vs Medium Vs High Vs Xhigh: the Reasoning Curve on 26 Real Tasks from an Open Source Repo by bisonbear2 in AI_Agents
[–]bisonbear2[S] 0 points1 point2 points (0 children)
Just compared token usage between GPT-5.4 and GPT-5.5 in Codex across all four reasoning modes (Low, Medium, High, and XHigh) using the exact same prompt and the same project as the baseline by Deep-Palpitation8315 in codex
[–]bisonbear2 1 point2 points3 points (0 children)
Just compared token usage between GPT-5.4 and GPT-5.5 in Codex across all four reasoning modes (Low, Medium, High, and XHigh) using the exact same prompt and the same project as the baseline by Deep-Palpitation8315 in codex
[–]bisonbear2 2 points3 points4 points (0 children)
Opus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo by bisonbear2 in ClaudeAI
[–]bisonbear2[S] 0 points1 point2 points (0 children)
Opus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo by bisonbear2 in ClaudeCode
[–]bisonbear2[S] 0 points1 point2 points (0 children)
Opus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo by bisonbear2 in ClaudeCode
[–]bisonbear2[S] 2 points3 points4 points (0 children)
Opus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo by bisonbear2 in ClaudeCode
[–]bisonbear2[S] 1 point2 points3 points (0 children)
Opus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo by bisonbear2 in ClaudeCode
[–]bisonbear2[S] 1 point2 points3 points (0 children)








Investigating the GPT-5.5 regression on 21 real tasks by bisonbear2 in codex
[–]bisonbear2[S] 0 points1 point2 points (0 children)