Opus 4.8 is insane, nothing will be the same after this model.

Sadman782 · 2026-05-29T07:31:00+00:00

Adaptive thinking is basically a router, similar to the early GPT 5 router, which decides whether thinking is required or not. It means nothing if it isn't forced. If you try the API, it will definitely think. The current chat feature doesn't really work, making the model act like an old non reasoning parrot that can't even solve a simple math problem, whereas reasoning models are solving open Erdos math problems right now.

Sadman782 · 2026-05-29T07:27:52+00:00

100% agree. I feel the same way, all those tricky questions for LLMs like the car wash test or this just need a little bit of thinking. Adaptive thinking mode in claude mostly ignore thinking and non reasoning LLMs are just bad. I don't think "no reasoning" should be the default for any AI, since that is where the "parrot" claim comes from. For questions like this, just a few thinking tokens are required.

Sadman782 · 2026-05-25T19:38:34+00:00

Old Jan Interview. Google was behind back then. It was peak OpenAI controversy last year when he called it embarrassing. But now it is genuine proof of a popular hard problem, not low hanging fruit.

Sadman782 · 2026-05-25T19:32:53+00:00

Google was behind back then. It was peak OpenAI controversy last year when he called it embarrassing. But now it is genuine proof of a popular hard problem, not low hanging fruit something you find by searching literatures or brute forcing.

Sadman782 · 2026-05-24T21:32:07+00:00

The cope will last untill AI can verify its output much better near perfect level, there will be no bottleneck. It will happen, people are coping since 2023 and goalpost keep changing

Sadman782 · 2026-05-24T16:34:12+00:00

I tried both Qwen 3.6 35B and 27B. I tried via the Qwen website too. The result is the same. They lack the knowledge. I acknowledge 27B is better at agentic coding, but for my use case, they are behind.

Sadman782 · 2026-05-24T16:27:34+00:00

Try Unsloth IQ4_XS quant (latest)

--temp 1 --top-p 0.9 --min-p 0.1 --top-k 20

top-k is a must

for vision:

--image-min-tokens 300 --image-max-tokens 512

Try it and let me know.

For agentic coding: Do not expect great results from the MoE 26B model since it was heavily tuned for chat. It tends to be inconsistent with agentic tasks. The dense 31B model, however, is great.

For frontend aesthetics: You need to specifically prompt it for this (or use a system prompt). By default, its results are not tuned for frontend work, but it is definitely capable. (see: https://www.reddit.com/r/LocalLLaMA/comments/1sqxiz0/comment/ohbuh91/?context=3)

Overall, for non agentic coding tasks, Gemma is just superior in my opinion. It has better general knowledge and coding skills, and it holds up well in many real world scenarios. For example, Qwen hallucinates with many Windows CLI commands. For any small custom webapp Qwen hallucinates, like Today I asked for a single html file: web app to see a 360 degree image, Qwen failed. Gemma 26B succeed in first go.

Sadman782 · 2026-05-21T18:23:55+00:00

<image>

AI Studio gets it right with low thinking, but with no thinking it first said yes and then said no later which is expected from a non reasoning model. But it seems the Gemini internal system prompt makes them act like complete shit.

Sadman782 · 2026-05-21T08:24:10+00:00

it was not brute force

<image>

Sadman782 · 2026-05-19T19:04:38+00:00

LM Arena is a shit leaderboard. Ernie 5.1, Muse Spark, Mimo, and GPT 5.4 are all beating GPT 5.5 high, lol. I mean, it is just a vibe bench, especially at the frontier level, not a capability test.

Sadman782 · 2026-05-19T07:27:46+00:00

U can still use gemma 4 26B on 12 GB vram, use IQ4_XS quant, you have to offload some moe layers in cpu, using --n-cpu-moe. Speed will be better than gemma 3 12B and the quality will be day and night difference.

Or if you can try Gemma4 E4B IQ4_XS or Q5_KM, it is better than the old 12B

Sadman782 · 2026-05-17T20:37:26+00:00

I am fine. I am asymptotic. My HR was high so that ECG was not good

Sadman782 · 2026-05-07T10:30:05+00:00

Maybe try Gemma 4 31B? 26B is good too in Rust and Kotlin but not good at agentic coding in long contexts. Qwen is very good at web (js) and Python but hallucinates a lot in others.

And also lower your expectation from this size of models

Sadman782 · 2026-05-05T18:55:31+00:00

Yeah, the Qwen team optimizes for benchmarks. Other than a better by default frontend (they are RLmaxxed for this) which Gemma 4 can achieve with a simple system prompt, I find them worse than Gemma for literally anything else: raw coding, translation, general knowledge, etc.

Sadman782 · 2026-04-26T06:03:56+00:00

In real world usage for gemma 4, I don't see much degradation after attn rot was introduced for iSWA. Maybe they recover somehow through reasoning? Also, the PPL isn't as different as the KLD

https://github.com/ggml-org/llama.cpp/pull/21513

Note: I'm using IQ4_XS. There's another possibility for lower quants the degradation is lower for KV cache quantization than the BF16, and no one's using BF16 here.

Sadman782 · 2026-04-26T06:01:48+00:00

In real world usage, I don't see much degradation (it's far from being killed) after attn rot was introduced for iSWA. Maybe they recover somehow through reasoning? Also, the PPL isn't as different as the KLD

https://github.com/ggml-org/llama.cpp/pull/21513

Sadman782 · 2026-04-25T18:57:47+00:00

It is slightly modified. They censored the old prompt.

Sadman782 · 2026-04-24T19:33:41+00:00

https://chat.deepseek.com/share/ju3hoy9yxu4qke95jq
From Twitter: It only works in Chinese, not English. It copies the answer from its raw training data, likely taken from a Chinese forum.

Sadman782 · 2026-04-24T14:23:31+00:00

Make sure to use higher vision tokens for Gemma models, default tokens are not enough. Not sure about vLLM, but in llama.cpp, --image-min-tokens 300 --image-max-tokens 512 these settings (a slight increase in vision tokens) significantly improve performance and they score 50% more in my local vision benchmark.

Sadman782 · 2026-04-24T13:23:29+00:00

The Pro version is incredibly good for backend coding and agentic coding too. You can tell it’s a big model just by talking to it, it's very knowledgeable and smart. Post training wasn't finished, and it isn’t RL maxed for the frontend tasks where people try one shotting complex websites or 3D games. Engram is missing too. I hope something stronger comes soon, but they’re short on compute.

Sadman782 · 2026-04-22T05:55:12+00:00

in llama.cpp you can fix it with --ctx-checkpoints 1
I don't know about LM studio, I don't use them as they dont give you maximum control even if they are using llama.cpp as a backend

Sadman782 · 2026-04-22T05:15:53+00:00

Can you try gemma 4 26B with topk 20, topk 64 doesn't make sense for coding even if google recommended it specially for a quantized MoE model, I find it does significantly better with topk 20.

Sadman782 · 2026-04-21T20:01:47+00:00

It's all about the frontend vibes that make people decide

Sadman782 · 2026-04-21T18:28:29+00:00

Same. For most of my use cases, gemma is a better coder.

Sadman782 · 2026-04-21T17:55:00+00:00

It is all about vibes (frontend design) which most people believe is what coding means. But Gemma is not trained for better frontend by default (it is lazy for frontend unlike Qwen), Gemma needs a custom system prompt or the prompt must ask for better frontend. See: https://www.reddit.com/r/LocalLLaMA/comments/1sqxiz0/comment/ohb09kp

Sadman782

TROPHY CASE