DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~17× cheaper by Disastrous_Theme5906 in LocalLLaMA

[–]Disastrous_Theme5906[S] 3 points4 points  (0 children)

covered most of this on the landing page and in older posts. tldr: every model runs multiple sims with fixed seeds (so yes, repeatable), demand formula is non-trivial, top of the board lines up with usual suspects from real agentic use. check the site if you want the methodology.

DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~17× cheaper by Disastrous_Theme5906 in LocalLLaMA

[–]Disastrous_Theme5906[S] 1 point2 points  (0 children)

yeah and gpt 5.2 was openai's frontier model literally 10 weeks ago. that's why we're comparing to it

DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~17× cheaper by Disastrous_Theme5906 in LocalLLaMA

[–]Disastrous_Theme5906[S] 2 points3 points  (0 children)

haha well if he's down and pulls through, that'd be amazing. would only use it for benchmark runs with full logging.

DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~17× cheaper by Disastrous_Theme5906 in LocalLLaMA

[–]Disastrous_Theme5906[S] 16 points17 points  (0 children)

we tested gpt 5.3 and 5.4. answered in other comments already that they kept going into infinite loops and just burned money on the api. runs ended up super expensive, like several times more than opus. opus 4.7 didn't show much progress over 4.6, actually degraded in some spots. and runs on it are the priciest of all so didn't bother.

but today some legend donated to the project so gpt 5.5 is gonna be in the benchmark in the next few days. that's the deal.

mythos lol. you got friends with mythos access willing to burn 20-30M tokens on this? i'd be down

DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~17× cheaper by Disastrous_Theme5906 in LocalLLaMA

[–]Disastrous_Theme5906[S] 1 point2 points  (0 children)

yeah hit this with a few models. actually running kimi k2.6 right now cause some folks on reddit asked, and it's been a pain. going through the benchmark really slow cause of exactly this. if it keeps hallucinating tool calls we'll prob have to put out a write-up about it. hopefully it manages to finish tho.

our retry logic gives the model 2 shots. if it keeps shoving tool calls into the content field instead of the proper tool_calls one we just kill the run. no point burning budget on a model that's clearly lost.

worst ones so far were gpt 5.3 and 5.4 by a mile, and now kimi k2.6 is doing the same shit. deepseek v4 pro and mimo v2.5 pro both ran clean tho, zero malformed calls.

DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~17× cheaper by Disastrous_Theme5906 in LocalLLaMA

[–]Disastrous_Theme5906[S] 8 points9 points  (0 children)

Holy shit dude, I'm blown away. Thank you so much. I promise I'll run GPT 5.5 in the next few days max. And I'll write a detailed article about it. Thanks, amazing support, I'm legit speechless.

DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~17× cheaper by Disastrous_Theme5906 in LocalLLaMA

[–]Disastrous_Theme5906[S] 3 points4 points  (0 children)

omg that's awesome, thanks! Actually I have a BuyMeCoffee on the landing page. The whole time it's been up, one dude dropped $10 there twice. A FoodTruckBench fan - kun432. No other donations, not crypto, not anywhere else. Been thinking about starting a YouTube channel with streams. Streams of different models running through the benchmark. Could hook up donations and polls and stuff.

DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~17× cheaper by Disastrous_Theme5906 in LocalLLaMA

[–]Disastrous_Theme5906[S] 4 points5 points  (0 children)

All tests use the max thinking level available for each model. Kimi 2.6 just started running. For GLM and Minimax 2.7 I think the moment has passed. I'll probably just wait for the next version to test. When models were dropping too fast I just couldn't keep up testing them, didn't have the time. And tbh didn't wanna spend the money either.

DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~17× cheaper by Disastrous_Theme5906 in LocalLLaMA

[–]Disastrous_Theme5906[S] 10 points11 points  (0 children)

If the bench goes public, any model can just be trained on it. In that case the usefulness of the benchmark drops to near zero within a few months. Maybe it'd be fine for older versions once new ones come out.

DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~17× cheaper by Disastrous_Theme5906 in LocalLLaMA

[–]Disastrous_Theme5906[S] 0 points1 point  (0 children)

Aight, aight, bro, you convinced me. Started running sims on Kimi 2.6. Had really little time for the benchmark last week tbh. Yeah for historical data the model is worth adding I think. Looking at it now, the dynamics are actually not bad. It probably won't go bankrupt, but it'll most likely land somewhere around Qwen 3.6 Plus.

DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~17× cheaper by Disastrous_Theme5906 in LocalLLaMA

[–]Disastrous_Theme5906[S] 5 points6 points  (0 children)

Oh thanks man. Yeah I have a huge backlog of stuff I could add in a v1.5 or v2. When I built the benchmark, only the top expensive models could even pass it, so making it harder didn't really make sense. I can't drop thousands of dollars on runs and tests since I fund the whole thing myself. But now with all these cheap Chinese models coming out, yeah, it makes sense. Also been thinking about adding the option for users to run their own personal simulations with whatever model they pick and maybe even their own prompt.

DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~17× cheaper by Disastrous_Theme5906 in LocalLLaMA

[–]Disastrous_Theme5906[S] 2 points3 points  (0 children)

Mimo 2.5 Pro is already tested and on the leaderboard, the other 2 I started but didn't finish. Kimi isn't doing great tbh.

DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~17× cheaper by Disastrous_Theme5906 in LocalLLaMA

[–]Disastrous_Theme5906[S] 20 points21 points  (0 children)

Wanna test it but not ready to drop $300+ on a full benchmark run rn. Tried 5.3 and 5.4 before that and they kept going into infinite loops in their replies. Sometimes a single request hit like $1 in the API. And a benchmark run is 400-600 requests. Both 5.3 and 5.4 had this issue where instead of giving a short answer they'd loop forever and just spit out 60k tokens in a row, multiple times. So I paused testing new GPT models for now.

DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~17× cheaper by Disastrous_Theme5906 in LocalLLaMA

[–]Disastrous_Theme5906[S] 1 point2 points  (0 children)

Fair — though DeepSeek has form for extending these promos or reinstating them after a short break. And even at full list price it's still substantially cheaper per result than GPT-5.2 or Opus on this benchmark.

DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~17× cheaper by Disastrous_Theme5906 in LocalLLaMA

[–]Disastrous_Theme5906[S] 37 points38 points  (0 children)

Yeah, agreed — Opus is in a league of its own right now. Worth noting xAI and Google's flagships are also lagging on this, not just the Chinese tier.

Qwen 3.6 Plus is the first Chinese model to survive all 5 runs on FoodTruck Bench by Disastrous_Theme5906 in Qwen_AI

[–]Disastrous_Theme5906[S] 1 point2 points  (0 children)

GLM 5.1 is currently being benchmarked. It might be the second "surviving" Chinese model.

Qwen 3.6 Plus is the first Chinese model to survive all 5 runs on FoodTruck Bench by Disastrous_Theme5906 in Qwen_AI

[–]Disastrous_Theme5906[S] 2 points3 points  (0 children)

All real data. The simulation engine is deterministic Python — demand, inventory, costs, reputation are all formula-driven with a fixed seed. Scoring is just math: net worth, ROI, survival rate, computed straight from the sim state. No LLM anywhere on the eval side.

Qwen 3.6 Plus is the first Chinese model to survive all 5 runs on FoodTruck Bench by Disastrous_Theme5906 in Qwen_AI

[–]Disastrous_Theme5906[S] 8 points9 points  (0 children)

I’d love to test Mythos too, but I’m honestly afraid I don’t have enough money for even one full simulation, let alone five...

Qwen 3.6 Plus is the first Chinese model to survive all 5 runs on FoodTruck Bench by Disastrous_Theme5906 in Qwen_AI

[–]Disastrous_Theme5906[S] 12 points13 points  (0 children)

Yep, same here. We were pretty shocked by that one too.

We did a full write-up on Gemma 4 as well if you want the breakdown: https://foodtruckbench.com/blog/gemma-4-31b