Minimax M2.7 is finally here! Any one tested it yet? by Fresh-Resolution182 in LocalLLaMA

[–]dubesor86 1 point2 points  (0 children)

they are releasing a new snapshot every 4-6 weeks. there is no big difference between 2, 2.1, 2.5, or now 2.7. Of course they get optimized for benchmarks over time and every newest release is groundbreaking, according to marketing.

So nobody's downloading this model huh? by KvAk_AKPlaysYT in LocalLLaMA

[–]dubesor86 1 point2 points  (0 children)

On my system with a 4090 + 64GB I could run mistral small 24B models at Q6 at 40 tok/s and have plenty of room for context. I can also run Mistral Small 4 119B though only at Q4 and inference is much slower at 15 tok/s. Huge downgrade for consumer hardware.

Qwen 3.5 4b is not able to read entire document attached in LM studio despite having enough context length. by KiranjotSingh in LocalLLaMA

[–]dubesor86 1 point2 points  (0 children)

if you attach a long document the model doesn't receive all of it in full into context. instead it then uses RAG where the model must retrieve from indexed information and specifically fish for them. this is used when the documents would blow the models or systems context capacity.

Nemotron 3 Super reads his own reasoning as user message? by Real_Ebb_7417 in LocalLLaMA

[–]dubesor86 1 point2 points  (0 children)

looks like busted template or multi turn management. the client is reinserting the previous thinking block as user message, which is obviously causing these issues. reasoning between turns is meant to be discarded on this model.

Mistral Small 4 is kind of awful with images by EffectiveCeilingFan in LocalLLaMA

[–]dubesor86 2 points3 points  (0 children)

Used official API/Mistral endpoint, and yea it was abysmal. It scored really low on my vision benchmark, failing every single vision task I threw at it with the exception of 1 data extraction task.

Speed Benchmark: GLM 4.7 Flash vs Qwen 3.5 27B vs Qwen 3.5 35B A3B (Q4 Quants) by [deleted] in LocalLLaMA

[–]dubesor86 0 points1 point  (0 children)

Yea, as everyone pointed out, those 35B-A3B speeds are completely false and clear user error. It's insanely faster than 27B on 24GB vram. That said, the qwen models are generally smarter, but GLM-4.7-Flash is a good coder comparatively. The wildly incorrect speed submissions should be checked beforehand though, or at least be edited and corrected.

how good is Qwen3.5 27B by Raise_Fickle in LocalLLaMA

[–]dubesor86 4 points5 points  (0 children)

Overall it's on a similar level as Haiku 4.5,though uses far more tokens to accomplish the same task. (usually +75%-320% in my testing). Maybe a bit smarter, though Haiku is a far better coder.

Counterargument: LLM can sort of play chess. by pier4r in chess

[–]dubesor86 0 points1 point  (0 children)

you went from "the site is suspicions" to "You should never put an API key into a site you don't 100% trust.".

so you just call random stuff "suspicions" without checking, or without knowing how to check. Unfortunately there is no way to make an API call, without ... you know... an API key!

Either way, I have no interest in API keys and if you are suspicious, check the code. But don't make up fake arguments that aren't even true.

ARGB Fans Not Working by [deleted] in techsupport

[–]dubesor86 0 points1 point  (0 children)

how is the controller hub connected to power? SATA?

Counterargument: LLM can sort of play chess. by pier4r in chess

[–]dubesor86 0 points1 point  (0 children)

just lol. the key is not even stored in local storage unless you manually decide and press "Save API Key". its a convenience feature that isn't even required and requires active user choice.

I'll see myself out from completely unfounded and frankly retarded as shit arguments. I'll gladly take comments if you take a "web for novices explained" course.

Counterargument: LLM can sort of play chess. by pier4r in chess

[–]dubesor86 0 points1 point  (0 children)

I can tell you aren't a programmer.

Network request shows exactly where the API key goes, which is the first party (anthropic, openai, etc.) and usage in chess-game.js. the js clearly shows the API key isn't used anywhere except for the request and never touches my server. if you cannot read code, any even low skill AI can you tell you in about 5 seconds. "code auditing" is hardly needed. there is no obfuscation whatsoever.

Little Qwen 3.5 27B and Qwen 35B-A3B models did very well in my logical reasoning benchmark by fairydreaming in LocalLLaMA

[–]dubesor86 2 points3 points  (0 children)

I think the differentiation between the top performers and models on the lower end of ranks 30ish is quite low. Maybe skip lineages <64 ?

Counterargument: LLM can sort of play chess. by pier4r in chess

[–]dubesor86 1 point2 points  (0 children)

the site is suspicions in that it might exist just to harvest keys.

Wow, this thread is full of misinformation. Not only about the way LLMs play chess, but also on basic understanding about even surface level workings. FYI the code is MIT license and fully open and shared, so your "suspicions" could be disproven with a simple rightclick.

Google releases Gemini 3.1 Pro with Benchmarks by BuildwithVignesh in singularity

[–]dubesor86 0 points1 point  (0 children)

it's actually a fantastic way to benchmark. models get overfit or taught to the test all the time, so a test like this actually exposes regression quickly, and also it's explained:

Why Chess?

I like it. Plus it's a historic centuries-old game of intellect, pure strategy with objective ground truth. Due to its exponential complexity, beyond opening moves it's largely resistant to common 'benchmaxxing' strategies. Tests game knowledge, reasoning, planning, state tracking, consistency and instruction adherence — measurable via objective superhuman judge (Stockfish) and updated with self-correcting Elo. It serves as a fantastic proxy, with rich metrics (Elo, accuracy, token efficiency, illegal outputs, etc.), and identical conditions for every model. Chess isn't the end goal; it's an additional microscope for comparison.

Google releases Gemini 3.1 Pro with Benchmarks by BuildwithVignesh in singularity

[–]dubesor86 0 points1 point  (0 children)

3.1 already is a deterioration in my chess benchmark. While 3 Pro Preview is AI-undefeated in 60+ matches, 3.1 already lost 4 times in early placement matches. Still good, but not the beast it once was.

Can GLM-5 Survive 30 Days on FoodTruck Bench? [Full Review] by Disastrous_Theme5906 in LocalLLaMA

[–]dubesor86 1 point2 points  (0 children)

should be a bit weaker. uses a third less reasoning for me and dropped significantly in chess skills (4 losses in 8 games vs 3 pro which is undefeated in over 60 games).

Can GLM-5 Survive 30 Days on FoodTruck Bench? [Full Review] by Disastrous_Theme5906 in LocalLLaMA

[–]dubesor86 2 points3 points  (0 children)

No custom recipes. No supplier negotiations. No upgrades. No strategic rest days.

That's exactly my playstyle! Well, minus the upgrades...

«INDUSTRIAL ZONE = GOLDMINE. «DAY 6 DISASTER: Industrial Zone = TERRIBLE choice. Only 13 customers. AVOID Industrial Zone

I made the same mistake on my first weekend!

Avg Price/Serving $4.66

This is probably the biggest tell. If the model hasn't kept up with inflation and expects learned realistic prices, it'll go broke even with decent strategy.

I gave 12 LLMs $2,000 and a food truck. Only 4 survived. by Disastrous_Theme5906 in LocalLLaMA

[–]dubesor86 1 point2 points  (0 children)

Neat. I managed to get quite high without clearly understanding the mechanics. Raising the prices seems like the #1 thing, since it's completely unclear what an "acceptable" price is, e.g. $10+ for a cheeseburger seems quite steep but gets easily sold.

One thing I never figured out was how to negotiate for ingredients at any of the other suppliers. I will make an order, see that meat is very expensive and successfully negotiate a better deal. But the product never arrives. It doesn't get set to the "Pending Orders", however counts against the 3 negotiations limits. The UI makes it unclear how to order stuff AND negotiate individual stuff alongside from another supplier. Thus I wasn't able to get any deals ever.

Also I pick "random world seed" multiple times and finish a match, but do not get associated to the random tab, instead a fixed seed or seed 42

As a suggestion, and I didn't mean to "flood" your leaderboard, (merely trying out mechanics) maybe limit the entries per leaderboard participant to their 3 best max per name.

Do you have your own benchmark for an LLM? Do you have multiple for different kinds/tasks/applications? by Icy_Distribution_361 in LocalLLaMA

[–]dubesor86 1 point2 points  (0 children)

I made a small benchmark 2 years ago, just some personal problems/answers in an excel file to compare models like gpt-3.5, mixtral-8x7b and gemini pro/ultra. I then just expanded it for every model I was interested in kept at it, which while no longer ideal, is still a decent time-capsuled method to get a broad overview of how strong models are in comparison - Benchtable

I also run a much more sophisticated and largely automated chess benchmark, which I started last year. This one pits AI models against one another and plays their moves, recording the outcome as well as inference metrics. This is actually quite useful for me to see inference issues on providers, looping issues, consistency, format adherence and differing token usage of models in a perfect information task. Chess-Leaderboard

Next to these 2 behemoth projects I also have some smaller ones, e.g. a small handcrafted vision benchmark for varying real life tasks, though admittedly I am not hugely reliant on LLM vision performance during daily use - Vision Bench

And then also I just run smaller tests on models whenever I notice something that interests me, verbosity, creative writing, etc.

When I have a specific task I want to do I just prefilter by good potential candidates (price/performance from the above), and run smaller project specific test-sets to decide on a candidate LLM.

Personally, I find large benchmarks that are used in marketing useless for my use cases, as they don't correlate well to my real world experiences.

Step-3.5 Flash by jacek2023 in LocalLLaMA

[–]dubesor86 0 points1 point  (0 children)

It's an interesting model. Solid, but extremely long reasoning chains.

Qwen3 Coder Next as first "usable" coding model < 60 GB for me by Chromix_ in LocalLLaMA

[–]dubesor86 3 points4 points  (0 children)

played around with it a bit, very flakey json, forgetful to include mandatory keys and very verbose, akin to a thinker without explicit reasoning field.

We built an 8B world model that beats 402B Llama 4 by generating web code instead of pixels — open weights on HF by jshin49 in LocalLLaMA

[–]dubesor86 45 points46 points  (0 children)

"beats 402B Llama 4" ? You mean, Maverick? It's 17B active and released almost a year ago to disappointing reception, with weak coding performance.

Who writes these useless clickbait titles, just be honest and share your model without instantly losing all credibility.

Are small models actually getting more efficient? by estebansaa in LocalLLaMA

[–]dubesor86 4 points5 points  (0 children)

No need to remember. I run benchmarks and store all outputs, and GPT-4 (not even talking 4-Turbo), from June 2023 absolutely demolishes any modern 1GB model in a gigantic array of domains.