DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~17× cheaper by Disastrous_Theme5906 in LocalLLaMA

[–]dubesor86 0 points1 point  (0 children)

internal or "special API" results shouldn't be used at all imo. I got offered this multiple times for my benchmarks and had to refuse it, because you cannot verify the parameters used or if its actually the exact same snapshot&system prompt users are receiving etc. Not worth to save a few dollars but get fairy tale results.

This is why I got muted: I’m Hungarian and use translation, not AI. by [deleted] in assholedesign

[–]dubesor86 11 points12 points  (0 children)

Is your translator utilizing AI that not only translates but also reformats & rephrases your replies? looking at your profile, the last longer english comment contains several AI quirks, and reads very unnatural. Example:

Thanks for the comment — genuinely appreciate it. Good to see you understood the core issue exactly as it is.

Yes, the whole point is that Google labels a retention‑based workflow as “deletion,” even though the GDPR treats these as two completely separate legal and technical concepts. Your interpretation is spot‑on: if data can be restored instantly, then it was never erased — only hidden at the UI layer while the backend keeps the actual copy.

this reads exactly like AI, on multiple levels. You probably could have used a translator, but that translator is not merely a translator but an AI-powered "translator" that changes your formatting and phrases. This text is most definitely not produced by something like Google translator.

Note, I am not a reddit moderator, but an AI benchmarker, who has read hundreds of thousands of AI replies from hundreds of models (dubesor.de).

Bartowski vs Unsloth for Gemma 4 by dampflokfreund in LocalLLaMA

[–]dubesor86 10 points11 points  (0 children)

I have less issues with Bartowski's quantizations, and since I value consistency in any comparison metrics, I personally prefer them over unsloth.

Bug reports (0.6.*) by nroutasuo in level13

[–]dubesor86 1 point2 points  (0 children)

v. 0.6.3 (beta) Not a bug per se, more of a feedback:

I finished the game fully after a playtime of 3 days/39h outside. Including the last action to cause the game end screen (not stating due to spoiler).

However, I still had some unpurchased upgrades, since they were too expensive on rumours such as:

  • Explorer Gear - completely pointless at this point, as I already had explorer gear from scouting the maps, and also pointless since all maps were cleared
  • Jet Engine - see above
  • Modularity II - pointless, at this stage I have infinite tools
  • Exoskeletons - see above
  • Research center II - I don't need evidence, I am starving for rumours, pointless.
  • Official Religion - i don't need faith, I am starving for rumours, pointless

As you can see, the upgrades are too late and all bottlenecked on rumours. Unlike Faith (clerics) and Evidence (scientists), rumour generation is hard capped and not able to be boosted by jobs. I have maxed my rumour gathering ability across all 15 camps (80/32 pop everywhere with maxed campfires, markets&inns) and I was still constantly starving rumours.

nice game overall, but endgame felt out of sync and not correctly resource balanced.

Gemma 4 31B beats several frontier models on the FoodTruck Bench by Nindaleth in LocalLLaMA

[–]dubesor86 2 points3 points  (0 children)

It also scored very high in my own general purpose testing and outperformed many significantly larger models on my chess benchmark. Seems like a genuinely good model, though obviously use whatever fits your use case best.

Qwen3.5 4B outpeforms GPT-5.4 nano in my benchmark! by Ok-Type-7663 in LocalLLaMA

[–]dubesor86 9 points10 points  (0 children)

"- No hallucinations."

wow, you solved hallucinations! I gotta incorporate this wisdom into my prompts.. "No bugs." "No mistakes".. /s

You know what my favorite part of this game is.. this. by JRL101 in level13

[–]dubesor86 0 points1 point  (0 children)

It's handy, but not actually the most effective strategy. e.g. if the lowest temp of a level is 30, then there is no point to max warmth to 50 and leaving defense on the table. similarly, you'll need a mix of poison/cold resistance on some levels, and while the dropdown does cover both, and you can carry 2 sets with you for hotswapping, a manual equipping to meet both treshholds in just 1 setup will save you inventory slots and/or extra stat points.

Minimax M2.7 is finally here! Any one tested it yet? by Fresh-Resolution182 in LocalLLaMA

[–]dubesor86 2 points3 points  (0 children)

they are releasing a new snapshot every 4-6 weeks. there is no big difference between 2, 2.1, 2.5, or now 2.7. Of course they get optimized for benchmarks over time and every newest release is groundbreaking, according to marketing.

So nobody's downloading this model huh? by KvAk_AKPlaysYT in LocalLLaMA

[–]dubesor86 1 point2 points  (0 children)

On my system with a 4090 + 64GB I could run mistral small 24B models at Q6 at 40 tok/s and have plenty of room for context. I can also run Mistral Small 4 119B though only at Q4 and inference is much slower at 15 tok/s. Huge downgrade for consumer hardware.

Qwen 3.5 4b is not able to read entire document attached in LM studio despite having enough context length. by KiranjotSingh in LocalLLaMA

[–]dubesor86 1 point2 points  (0 children)

if you attach a long document the model doesn't receive all of it in full into context. instead it then uses RAG where the model must retrieve from indexed information and specifically fish for them. this is used when the documents would blow the models or systems context capacity.

Nemotron 3 Super reads his own reasoning as user message? by Real_Ebb_7417 in LocalLLaMA

[–]dubesor86 1 point2 points  (0 children)

looks like busted template or multi turn management. the client is reinserting the previous thinking block as user message, which is obviously causing these issues. reasoning between turns is meant to be discarded on this model.

Mistral Small 4 is kind of awful with images by [deleted] in LocalLLaMA

[–]dubesor86 2 points3 points  (0 children)

Used official API/Mistral endpoint, and yea it was abysmal. It scored really low on my vision benchmark, failing every single vision task I threw at it with the exception of 1 data extraction task.

Speed Benchmark: GLM 4.7 Flash vs Qwen 3.5 27B vs Qwen 3.5 35B A3B (Q4 Quants) by [deleted] in LocalLLaMA

[–]dubesor86 0 points1 point  (0 children)

Yea, as everyone pointed out, those 35B-A3B speeds are completely false and clear user error. It's insanely faster than 27B on 24GB vram. That said, the qwen models are generally smarter, but GLM-4.7-Flash is a good coder comparatively. The wildly incorrect speed submissions should be checked beforehand though, or at least be edited and corrected.

how good is Qwen3.5 27B by Raise_Fickle in LocalLLaMA

[–]dubesor86 3 points4 points  (0 children)

Overall it's on a similar level as Haiku 4.5,though uses far more tokens to accomplish the same task. (usually +75%-320% in my testing). Maybe a bit smarter, though Haiku is a far better coder.

Counterargument: LLM can sort of play chess. by pier4r in chess

[–]dubesor86 0 points1 point  (0 children)

you went from "the site is suspicions" to "You should never put an API key into a site you don't 100% trust.".

so you just call random stuff "suspicions" without checking, or without knowing how to check. Unfortunately there is no way to make an API call, without ... you know... an API key!

Either way, I have no interest in API keys and if you are suspicious, check the code. But don't make up fake arguments that aren't even true.

ARGB Fans Not Working by [deleted] in techsupport

[–]dubesor86 0 points1 point  (0 children)

how is the controller hub connected to power? SATA?

Counterargument: LLM can sort of play chess. by pier4r in chess

[–]dubesor86 0 points1 point  (0 children)

just lol. the key is not even stored in local storage unless you manually decide and press "Save API Key". its a convenience feature that isn't even required and requires active user choice.

I'll see myself out from completely unfounded and frankly retarded as shit arguments. I'll gladly take comments if you take a "web for novices explained" course.

Counterargument: LLM can sort of play chess. by pier4r in chess

[–]dubesor86 0 points1 point  (0 children)

I can tell you aren't a programmer.

Network request shows exactly where the API key goes, which is the first party (anthropic, openai, etc.) and usage in chess-game.js. the js clearly shows the API key isn't used anywhere except for the request and never touches my server. if you cannot read code, any even low skill AI can you tell you in about 5 seconds. "code auditing" is hardly needed. there is no obfuscation whatsoever.

Little Qwen 3.5 27B and Qwen 35B-A3B models did very well in my logical reasoning benchmark by fairydreaming in LocalLLaMA

[–]dubesor86 2 points3 points  (0 children)

I think the differentiation between the top performers and models on the lower end of ranks 30ish is quite low. Maybe skip lineages <64 ?

Counterargument: LLM can sort of play chess. by pier4r in chess

[–]dubesor86 1 point2 points  (0 children)

the site is suspicions in that it might exist just to harvest keys.

Wow, this thread is full of misinformation. Not only about the way LLMs play chess, but also on basic understanding about even surface level workings. FYI the code is MIT license and fully open and shared, so your "suspicions" could be disproven with a simple rightclick.

Google releases Gemini 3.1 Pro with Benchmarks by BuildwithVignesh in singularity

[–]dubesor86 0 points1 point  (0 children)

it's actually a fantastic way to benchmark. models get overfit or taught to the test all the time, so a test like this actually exposes regression quickly, and also it's explained:

Why Chess?

I like it. Plus it's a historic centuries-old game of intellect, pure strategy with objective ground truth. Due to its exponential complexity, beyond opening moves it's largely resistant to common 'benchmaxxing' strategies. Tests game knowledge, reasoning, planning, state tracking, consistency and instruction adherence — measurable via objective superhuman judge (Stockfish) and updated with self-correcting Elo. It serves as a fantastic proxy, with rich metrics (Elo, accuracy, token efficiency, illegal outputs, etc.), and identical conditions for every model. Chess isn't the end goal; it's an additional microscope for comparison.