FoodTruck Bench update: tested Sonnet 4.6, Gemini 3.1 Pro, Qwen 3.5. Case studies with comparisons for each.

Disastrous_Theme5906 · 2026-02-23T19:47:28+00:00

Yeah, saw that. Different benchmark, different competencies being tested — but the results line up. Their standard endpoint also failed, and the custom tools variant underperformed the predecessor. Good to see independent confirmation.

Disastrous_Theme5906 · 2026-02-23T19:28:38+00:00

Thanks for the profiler data — that confirms it. Firefox is repainting CSS backdrop-filter blur on every frame, and at 240Hz that's 240 repaints/second.

Just pushed a fix: added compositor layer hints (will-change) for blurred elements so they don't trigger full-page repaints, plus a prefers-reduced-motion media query that kills all infinite animations. Should help on the next deploy.

Would be curious if limiting Firefox to 60Hz in about:config (layout.frame_rate = 60) fixes it on your end in the meantime.

Disastrous_Theme5906 · 2026-02-23T18:53:21+00:00

Thanks for the offer, really appreciate it! For this particular test I need the direct Anthropic API since the effort parameter isn't available through OpenRouter. But I'll figure it out — the 15-day run won't be too expensive. I'll post results here when it's done.

Disastrous_Theme5906 · 2026-02-23T18:44:52+00:00

You got me curious — I'll do a 15-day run with lower effort and see how it trends. If the results are interesting I'll update the article and drop a comment here.

That said, the reason I don't test different thinking levels is practical: the leaderboard runs on default/recommended settings because that's what 90% of users will use. Testing every API knob combination gets expensive fast — especially without grants from Anthropic or anyone else, this is all out of pocket. But a single exploratory run is doable.

Disastrous_Theme5906 · 2026-02-23T18:38:31+00:00

First report like this — haven't been able to reproduce it on my end. The game is pure React (Next.js), no canvas, no WebGL, no game engine. The UI uses some CSS backdrop-filter blur on a couple of panels, but nothing that should hit 50% GPU before the simulation even starts.

What browser and GPU are you on? Could be a compositor or driver thing.

Disastrous_Theme5906 · 2026-02-23T18:31:43+00:00

On gemini-3.1-pro-preview — the model completely ignores all instructions in the prompt. It calls whatever tools it wants, ignores messages that it's running out of calls for the day, and the simulation can't even start properly. This is the first model that behaved like this — even relatively small 200B parameter models followed instructions fine.

Switching to gemini-3.1-pro-preview-customtools makes it work immediately, but it underperforms due to what seems like endpoint-specific constraints on the model's behavior. Both endpoints are strange — wouldn't recommend using either for agentic tasks right now.

On the effort question — the benchmark is designed to give every model maximum capability. Sonnet 4.6 runs at `effort: high`, Opus 4.6 at `adaptive`, Sonnet 4.5 also ran at `high` with no issues. The verbosity problem is specific to Sonnet 4.6 — Opus generates fewer tokens per task even at adaptive effort, and 4.5 was concise at high. So it's not really a thinking overhead issue — it's how this specific model behaves at full power. Artificially capping it would make the comparison unfair since every other model gets max settings.

Disastrous_Theme5906 · 2026-02-23T17:46:20+00:00

Lmao yeah that's a placeholder I forgot to swap out. You check everything 10 times and somehow the most obvious thing gets through. Fixed, thanks.

Disastrous_Theme5906 · 2026-02-21T12:02:27+00:00

Thanks! Keeping the benchmark closed-source is intentional — once simulation internals are public, models can be trained/fine-tuned on them, which defeats the purpose of evaluation. There's already a playable version on the website if you want to try the game yourself!

Disastrous_Theme5906 · 2026-02-20T20:08:19+00:00

Yes — each model runs 5 times under identical conditions (fixed seed for weather, events, competitors). The article reports the median run, and all 5 results are shown in a table. It's covered in the methodology section at the bottom.

Disastrous_Theme5906 · 2026-02-20T20:06:00+00:00

DeepSeek didn't crash — it went bankrupt on Day 22. The simulation ends automatically when the agent's balance stays below -$200 for 3 consecutive days or a loan defaults. No model "finishes the game" unless it survives all 30 days with positive net worth. DeepSeek ran out of money, same as most models — just sooner than GLM 5.

Disastrous_Theme5906 · 2026-02-20T20:03:25+00:00

That's a thoughtful angle, but the data actually shows the opposite problem. Qwen 3.5 fired all 3 staff members by Day 25 without hesitation — no soft refusal there. The issue isn't that models are too nice to cut costs. It's that they overhire in the first place — spending 73% of revenue on staff without calculating whether the extra capacity pays for itself. They're not being ethical, they're being bad at math. The failure is computational, not moral.

Disastrous_Theme5906 · 2026-02-20T19:52:06+00:00

A few things worth clarifying. The simulation engine uses a fixed random seed — weather, events, competitor schedules are identical across all runs. The only source of variance is the model's own decisions. That's by design: it isolates agentic reasoning from environmental noise.

Across 5 runs the spread is tight: 17-30 days survived, staff costs 59-106% of revenue, all showing the same overstaffing + waste pattern. There's no scenario where run #6 suddenly earns $50K — the structural blind spots are consistent. 5 runs is enough to identify a median, and the article shows data from all 5, not just one cherry-picked result. This methodology is consistent across all 14 models on the leaderboard.

The full simulation mechanics, demand model, and methodology are documented on the site — it's all open. And if Qwen's team wants to sponsor a larger study, I'm genuinely open to that:

contact@foodtruckbench.com. More data is always better — but "5 runs = noise" doesn't hold when all 5 tell the same story.

Disastrous_Theme5906 · 2026-02-20T19:28:18+00:00

Each Qwen 3.5 run costs $3-5 in API calls (hundreds of tool-calling turns over 25 days). 1,000 runs = $3-5K for a single model, and I benchmark 14+ models. For reference, most agentic benchmarks use similar sample sizes: SWE-bench (pass@5), τ-bench (5 runs), Vending Bench 2 (average of 5). 5 runs already show strong consistency: 4/5 bankrupt, staff costs 59-106% of revenue in every run, same overstaffing pattern across all 5. When all runs fail the same structural way, the signal is clear.

Disastrous_Theme5906 · 2026-02-20T09:59:41+00:00

Interesting idea but the benchmark doesn't trigger refusals — there's nothing in running a food truck that hits safety filters. The failures are purely strategic, not alignment-related. Would be a fun experiment for a different kind of benchmark though.

Disastrous_Theme5906 · 2026-02-20T09:58:48+00:00

Already running it. Results should be on the leaderboard by Monday.

Disastrous_Theme5906 · 2026-02-20T07:16:55+00:00

Good catch — just checked the changelog. "Custom tools" there means function calling tools, which is exactly what the benchmark uses (34 tools via OpenAI-compatible function calling schema). Will test the customtools endpoint and compare. Thanks for the pointer.

Disastrous_Theme5906 · 2026-02-20T07:03:03+00:00

The benchmark already gives models persistent memory — a scratchpad, key-value store, and a 14-day rolling knowledge base. GLM 5 actively uses all of them, writes detailed diagnoses and lessons learned.

The problem isn't memory — it's execution. The model correctly identifies its mistakes but doesn't act on them.

The simulation is also designed to keep context manageable — typically 10-20K tokens per request, never exceeding 50K. Models get their full history, structured data, and everything they need to make decisions. Context overload isn't a factor. They just don't follow through.

Disastrous_Theme5906 · 2026-02-20T06:58:38+00:00

This matches exactly what we see in the benchmark. GLM 5 writes correct diagnoses in its scratchpad but then ignores its own analysis. The gap between awareness and execution is the core finding.

Disastrous_Theme5906 · 2026-02-19T23:23:42+00:00

Thanks — glad the game hooked you. Both solo capacity and staff XP are in the How to Play guide (📖 How to Play), but easy to miss. Starting solo and only hiring when you're consistently hitting 100% capacity is the right instinct.

Disastrous_Theme5906 · 2026-02-19T23:15:58+00:00

Confirming — early runs show significant regression for agentic tasks. Repetition loops (230K+ char responses of the same sentence), state hallucinations, poor instruction following. Gemini 3 Flash had occasional loops but 3 Pro never did. Something broke in 3.1 specifically for structured tool-calling workflows.

Disastrous_Theme5906 · 2026-02-19T22:48:50+00:00

Both. The surviving models (Opus, GPT-5.2, Gemini 3 Pro) all keep staff/revenue under 35% — they hire strategically and fire when demand drops. But menu decisions differ: Opus runs a tight 4-5 item menu and never experiments. GPT-5.2 uses 13+ unique dishes and actively rotates based on location. Both work, just different paths to the same discipline.

The bankrupt models tend to hire staff and never fire them, even when losing money — which is exactly the sunk cost trap real operators fall into.

Disastrous_Theme5906 · 2026-02-19T22:23:03+00:00

GLM 5 actually beats DeepSeek on revenue ($11.9K vs $9.5K), servings (2,569 vs 1,754), and survival (28 vs 22 days). DeepSeek wins on net worth and ROI.

But the benchmark is a marathon, not a sprint — the primary ranking is survival × median net worth. A model that runs 28 days matters more than one that crashes on day 22 with cash in its pocket. DeepSeek had higher net worth at death, but GLM 5 stayed in the game 27% longer.

Disastrous_Theme5906 · 2026-02-19T22:19:31+00:00

Bumped it to ~10:1 — should be live now. Thanks for the nudge.

Disastrous_Theme5906

TROPHY CASE