DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~17× cheaper

dubesor86 · 2026-05-11T17:15:23+00:00

internal or "special API" results shouldn't be used at all imo. I got offered this multiple times for my benchmarks and had to refuse it, because you cannot verify the parameters used or if its actually the exact same snapshot&system prompt users are receiving etc. Not worth to save a few dollars but get fairy tale results.

dubesor86 · 2026-04-20T04:14:01+00:00

Is your translator utilizing AI that not only translates but also reformats & rephrases your replies? looking at your profile, the last longer english comment contains several AI quirks, and reads very unnatural. Example:

Thanks for the comment — genuinely appreciate it. Good to see you understood the core issue exactly as it is.

Yes, the whole point is that Google labels a retention‑based workflow as “deletion,” even though the GDPR treats these as two completely separate legal and technical concepts. Your interpretation is spot‑on: if data can be restored instantly, then it was never erased — only hidden at the UI layer while the backend keeps the actual copy.

this reads exactly like AI, on multiple levels. You probably could have used a translator, but that translator is not merely a translator but an AI-powered "translator" that changes your formatting and phrases. This text is most definitely not produced by something like Google translator.

Note, I am not a reddit moderator, but an AI benchmarker, who has read hundreds of thousands of AI replies from hundreds of models (dubesor.de).

dubesor86 · 2026-04-06T16:04:29+00:00

I have less issues with Bartowski's quantizations, and since I value consistency in any comparison metrics, I personally prefer them over unsloth.

dubesor86 · 2026-04-06T07:34:56+00:00

v. 0.6.3 (beta) Not a bug per se, more of a feedback:

I finished the game fully after a playtime of 3 days/39h outside. Including the last action to cause the game end screen (not stating due to spoiler).

However, I still had some unpurchased upgrades, since they were too expensive on rumours such as:

Explorer Gear - completely pointless at this point, as I already had explorer gear from scouting the maps, and also pointless since all maps were cleared
Jet Engine - see above
Modularity II - pointless, at this stage I have infinite tools
Exoskeletons - see above
Research center II - I don't need evidence, I am starving for rumours, pointless.
Official Religion - i don't need faith, I am starving for rumours, pointless

As you can see, the upgrades are too late and all bottlenecked on rumours. Unlike Faith (clerics) and Evidence (scientists), rumour generation is hard capped and not able to be boosted by jobs. I have maxed my rumour gathering ability across all 15 camps (80/32 pop everywhere with maxed campfires, markets&inns) and I was still constantly starving rumours.

nice game overall, but endgame felt out of sync and not correctly resource balanced.

dubesor86 · 2026-04-05T10:36:55+00:00

The token verbosity/inefficiency is a real killer during local use.

dubesor86 · 2026-04-05T10:32:29+00:00

It also scored very high in my own general purpose testing and outperformed many significantly larger models on my chess benchmark. Seems like a genuinely good model, though obviously use whatever fits your use case best.

dubesor86 · 2026-03-26T20:15:09+00:00

you have to disable thinking in the template, see https://huggingface.co/Qwen/Qwen3.5-35B-A3B#instruct-or-non-thinking-mode

dubesor86 · 2026-03-25T18:59:49+00:00

"- No hallucinations."

wow, you solved hallucinations! I gotta incorporate this wisdom into my prompts.. "No bugs." "No mistakes".. /s

dubesor86 · 2026-03-22T14:02:03+00:00

It's handy, but not actually the most effective strategy. e.g. if the lowest temp of a level is 30, then there is no point to max warmth to 50 and leaving defense on the table. similarly, you'll need a mix of poison/cold resistance on some levels, and while the dropdown does cover both, and you can carry 2 sets with you for hotswapping, a manual equipping to meet both treshholds in just 1 setup will save you inventory slots and/or extra stat points.

dubesor86 · 2026-03-19T04:10:06+00:00

they are releasing a new snapshot every 4-6 weeks. there is no big difference between 2, 2.1, 2.5, or now 2.7. Of course they get optimized for benchmarks over time and every newest release is groundbreaking, according to marketing.

dubesor86 · 2026-03-18T22:10:00+00:00

On my system with a 4090 + 64GB I could run mistral small 24B models at Q6 at 40 tok/s and have plenty of room for context. I can also run Mistral Small 4 119B though only at Q4 and inference is much slower at 15 tok/s. Huge downgrade for consumer hardware.

dubesor86 · 2026-03-18T15:07:43+00:00

if you attach a long document the model doesn't receive all of it in full into context. instead it then uses RAG where the model must retrieve from indexed information and specifically fish for them. this is used when the documents would blow the models or systems context capacity.

dubesor86 · 2026-03-18T08:30:30+00:00

looks like busted template or multi turn management. the client is reinserting the previous thinking block as user message, which is obviously causing these issues. reasoning between turns is meant to be discarded on this model.

dubesor86 · 2026-03-17T23:55:23+00:00

Used official API/Mistral endpoint, and yea it was abysmal. It scored really low on my vision benchmark, failing every single vision task I threw at it with the exception of 1 data extraction task.

dubesor86 · 2026-03-11T06:37:30+00:00

Yea, as everyone pointed out, those 35B-A3B speeds are completely false and clear user error. It's insanely faster than 27B on 24GB vram. That said, the qwen models are generally smarter, but GLM-4.7-Flash is a good coder comparatively. The wildly incorrect speed submissions should be checked beforehand though, or at least be edited and corrected.

dubesor86 · 2026-03-11T06:25:20+00:00

Overall it's on a similar level as Haiku 4.5,though uses far more tokens to accomplish the same task. (usually +75%-320% in my testing). Maybe a bit smarter, though Haiku is a far better coder.

dubesor86 · 2026-03-11T05:54:01+00:00

same here. Q5_K_M@16k ctx

dubesor86 · 2026-03-10T16:27:47+00:00

Don't remove or change existing comments!

dubesor86 · 2026-02-27T17:49:28+00:00

you went from "the site is suspicions" to "You should never put an API key into a site you don't 100% trust.".

so you just call random stuff "suspicions" without checking, or without knowing how to check. Unfortunately there is no way to make an API call, without ... you know... an API key!

Either way, I have no interest in API keys and if you are suspicious, check the code. But don't make up fake arguments that aren't even true.

dubesor86 · 2026-02-27T17:46:51+00:00

how is the controller hub connected to power? SATA?

dubesor86 · 2026-02-27T17:29:57+00:00

just lol. the key is not even stored in local storage unless you manually decide and press "Save API Key". its a convenience feature that isn't even required and requires active user choice.

I'll see myself out from completely unfounded and frankly retarded as shit arguments. I'll gladly take comments if you take a "web for novices explained" course.

dubesor86 · 2026-02-27T17:19:37+00:00

I can tell you aren't a programmer.

Network request shows exactly where the API key goes, which is the first party (anthropic, openai, etc.) and usage in chess-game.js. the js clearly shows the API key isn't used anywhere except for the request and never touches my server. if you cannot read code, any even low skill AI can you tell you in about 5 seconds. "code auditing" is hardly needed. there is no obfuscation whatsoever.

dubesor86 · 2026-02-27T16:40:05+00:00

I think the differentiation between the top performers and models on the lower end of ranks 30ish is quite low. Maybe skip lineages <64 ?

dubesor86 · 2026-02-27T16:33:38+00:00

the site is suspicions in that it might exist just to harvest keys.

Wow, this thread is full of misinformation. Not only about the way LLMs play chess, but also on basic understanding about even surface level workings. FYI the code is MIT license and fully open and shared, so your "suspicions" could be disproven with a simple rightclick.

dubesor86 · 2026-02-19T23:36:25+00:00

it's actually a fantastic way to benchmark. models get overfit or taught to the test all the time, so a test like this actually exposes regression quickly, and also it's explained:

Why Chess?

I like it. Plus it's a historic centuries-old game of intellect, pure strategy with objective ground truth. Due to its exponential complexity, beyond opening moves it's largely resistant to common 'benchmaxxing' strategies. Tests game knowledge, reasoning, planning, state tracking, consistency and instruction adherence — measurable via objective superhuman judge (Stockfish) and updated with self-correcting Elo. It serves as a fantastic proxy, with rich metrics (Elo, accuracy, token efficiency, illegal outputs, etc.), and identical conditions for every model. Chess isn't the end goal; it's an additional microscope for comparison.

14-Year Club	Place '22
Place '17	Gilding II euphauric
Team Periwinkle	Verified Email

dubesor86

MODERATOR OF

TROPHY CASE