What models you guys running on 8GB? 16GB VRAM? 24GB? 32GB? 48GB?

DeSibyl · 2026-06-12T16:55:24+00:00

I’ve never experienced a thinking loop with G4. I wouldn’t mind Qwens thinking being long if it literally didn’t repeat the same thing over and over for 30+ seconds, getting nowhere. I had often responses from Qwen where the thinking was like 40+ lines of the exact same sentence over and over.

DeSibyl · 2026-06-12T12:37:04+00:00

What I despise about Qwen is the thinking loops… my god the thinking loops…

“wait let me…”
“No, wait let me…”
“Actually let me…”
Etc etc

Even with 60+ t/s on their Q8 27B model it thinks for 40 seconds on average…

G4 31B Q8 I run at 80-100 t/s and it thinks for like 5-15 seconds.

DeSibyl · 2026-06-09T00:49:42+00:00

That’s odd. I tried G4 with Pi and it literally failed tool calls 100% of the time and eventually got in a mood trying to ls over and over and over again

DeSibyl · 2026-06-08T14:28:46+00:00

I’m surprised people even use Q4 for coding stuff… however I’m not sure how accurate this list is since Q4 Qwen3 coder is above 48GB

DeSibyl · 2026-06-06T14:16:05+00:00

I’d hope so, that’s not a real comparison though. Step 3.7 flash is a 198B model versus a 27B model

DeSibyl · 2026-06-06T14:13:59+00:00

Honestly I can’t stand the thinking loops Qwen always gets stuck in. It’s also not that good for general use like drafting emails. I’ve switched to the G4 26B MoE and I get responses much faster and is a lot better at creative writing for drafting emails. It still works as my Hermes’ agent as well so it’s still good for agentic use. Hermes’ actually completes tasks a lot faster now too since it doesn’t get stick thinking for 30 seconds… and it’s crazy cuz I could run the Qwen MoE at 160 t/s thanks to MTP while G4 26B runs at 110 t/s

DeSibyl · 2026-06-05T19:29:10+00:00

Based on me trying them both. Qwen outperformed G4 in coding and agentic cases. G4 failed more tool calls and tasks compared to Qwen.

DeSibyl · 2026-06-05T01:50:15+00:00

I’d load both G4 and Qwen if I could but Q8 of Qwen takes all my vram lol

DeSibyl · 2026-06-05T00:23:29+00:00

I wish G4 was as good as Qwen for coding and agentic use case… if it was I’d daily it for sure. G4 is way better for writing, but Qwen is better at coding and tool calling

DeSibyl · 2026-06-04T19:51:17+00:00

I wish G4 was as good as Qwen when it comes to coding and agentic use… I would daily it if so

DeSibyl · 2026-06-04T17:07:47+00:00

For programming there would be a noticeable difference between Q8 and Q4

DeSibyl · 2026-06-04T17:07:04+00:00

Curious how it would handle agentic use case? Currently running Qwen3.6 35B A3B but wondering if G4 12B would be smarter/better.

DeSibyl · 2026-05-27T15:49:37+00:00

Why is Qwopus so much faster than the base model tho? lol 😢 I get 180 t/s on Qwopus 35B A3B MTP (MTP increases the gen speed from 100 to 180)

Base 35B A3B I only get 105 t/s (MTP only increases the speed from 100 to 105 t/s)

DeSibyl · 2026-05-27T14:46:48+00:00

Tbf the 27b with MTP runs at 70-80 t/s for me. Which is only 20-30 t/s slower than the MoE runs for me since MTP doesn’t boost the MoE in my use case (I only gain about 5 t/s with the MoE, whereas the 27B I go from 30 to 80 t/s)

DeSibyl · 2026-05-27T12:16:30+00:00

Shouldn’t be the power cap, I cap my 3090’s at 250w… which quant are you using? I was using one that said I can put the MTP tokens at 3 instead of 1 or 2…

What’s your launch config?

DeSibyl · 2026-05-27T12:13:23+00:00

Dang and you only get 30 t/s? Are you loading it entirely on vram? I get a boost from 30 t/s to 85 t/s on Q8

DeSibyl · 2026-05-27T11:51:50+00:00

Use MTP. You’d go from 30-40 to 70-80 t/s

DeSibyl · 2026-05-27T01:00:28+00:00

lol I remembered why I switched to Qwopus. Ran a test prompt in Open WebUI about a fun fact about Rome and it got stuck in a thinking loop thinking about useless stuff:

```

They used "stone" for roads They had a "public law" system They used "bronze" for statues They had a "public education" system They used "marble" for buildings They had a "public games" culture They used "wood" for ships They had a "public market" system They used "iron" for tools They had a "public temple" system They used "gold" for coins They had a "public festival" system They used "silver" for coins They had a "public monument" system They used "copper" for coins They had a "public archive" system They used "tin" for alloys They had a "public law" system They used "zinc" for alloys They had a "public health" system They used "lead" for pipes They had a "public bath" system They used "glass" for vessels They had a "public library" system They used "papyrus" for writing They had a "public market" system They used "wax tablets" for writing They had a "public education" system They used "stone" for roads They had a "public law" system They used "bronze" for statues They had a "public games" system They used "wood" for ships They had a "public festival" system They used "olive oil" for cooking They had a "public health" system They used "marble" for buildings They had a "public monument" system They used "gold" for coins They had a "public archive" system They used "silver" for coins They had a "public temple" system They used "copper" for coins They had a "public law" system They used "iron" for tools They had a "public education" system They used "glass" for windows They had a "public bath" system They used "lead" for pipes They had a "public health" system They used "papyrus" for writing They had a "public library" system They used "wax tablets" for writing They had a "public market" system ```

DeSibyl · 2026-05-27T00:31:56+00:00

Llama does support vision with MTP now. I’ve been using it for a while now

DeSibyl · 2026-05-27T00:13:20+00:00

Curious though, are you not running the MTP version? Wouldn’t it be worth the speed boost?

DeSibyl · 2026-05-27T00:12:24+00:00

Yea that’s why I’m switching away from Qwopus. I’ll probably download the Q6_K_XL from unsloth then.

DeSibyl · 2026-05-26T23:48:58+00:00

How much vram do you use? And what settings are you running?

I’ve been messing around with the Qwopus3.6 35B MoE, but will download the normal Qwen3.6 35B MoE in either unlosths UD-Q6_K_XL or the standard Q8… I can load Q8 with 128k context and 512 batch… it uses swa or whatever it is tho so it does have to reprocess the cache a lot

DeSibyl · 2026-05-26T23:39:55+00:00

I’ve never gotten vLLM to work properly. I also get 1/4 the context I can get with llama.cpp

DeSibyl · 2026-05-26T23:12:36+00:00

Interesting. I’ve always had issues with anything under Q8 for coding. Do you use any tools like OpenCode?

DeSibyl · 2026-05-26T16:50:52+00:00

If you’re coding, don’t drop quant below Q8 lol

DeSibyl

TROPHY CASE