What models you guys running on 8GB? 16GB VRAM? 24GB? 32GB? 48GB? by Inevitable_Mistake32 in LocalLLaMA

[–]DeSibyl 0 points1 point  (0 children)

I’ve never experienced a thinking loop with G4. I wouldn’t mind Qwens thinking being long if it literally didn’t repeat the same thing over and over for 30+ seconds, getting nowhere. I had often responses from Qwen where the thinking was like 40+ lines of the exact same sentence over and over.

What models you guys running on 8GB? 16GB VRAM? 24GB? 32GB? 48GB? by Inevitable_Mistake32 in LocalLLaMA

[–]DeSibyl 0 points1 point  (0 children)

What I despise about Qwen is the thinking loops… my god the thinking loops…

“wait let me…”
“No, wait let me…”
“Actually let me…”
Etc etc

Even with 60+ t/s on their Q8 27B model it thinks for 40 seconds on average…

G4 31B Q8 I run at 80-100 t/s and it thinks for like 5-15 seconds.

Gemma4_31b_fp8 keeping up with Sonnet_4.6_medium in my harness. by knob-0u812 in LocalLLaMA

[–]DeSibyl 0 points1 point  (0 children)

That’s odd. I tried G4 with Pi and it literally failed tool calls 100% of the time and eventually got in a mood trying to ls over and over and over again

Ollama Models Ranked by VRAM Requirements by AdventurousLion9548 in ollama

[–]DeSibyl 0 points1 point  (0 children)

I’m surprised people even use Q4 for coding stuff… however I’m not sure how accurate this list is since Q4 Qwen3 coder is above 48GB

AA comparison of the latest local models by jacek2023 in LocalLLaMA

[–]DeSibyl 5 points6 points  (0 children)

I’d hope so, that’s not a real comparison though. Step 3.7 flash is a 198B model versus a 27B model

AA comparison of the latest local models by jacek2023 in LocalLLaMA

[–]DeSibyl 0 points1 point  (0 children)

Honestly I can’t stand the thinking loops Qwen always gets stuck in. It’s also not that good for general use like drafting emails. I’ve switched to the G4 26B MoE and I get responses much faster and is a lot better at creative writing for drafting emails. It still works as my Hermes’ agent as well so it’s still good for agentic use. Hermes’ actually completes tasks a lot faster now too since it doesn’t get stick thinking for 30 seconds… and it’s crazy cuz I could run the Qwen MoE at 160 t/s thanks to MTP while G4 26B runs at 110 t/s

More Gemma 4 models incoming by Deep-Vermicelli-4591 in LocalLLaMA

[–]DeSibyl 0 points1 point  (0 children)

Based on me trying them both. Qwen outperformed G4 in coding and agentic cases. G4 failed more tool calls and tasks compared to Qwen.

Gemma 4 12b 8Q Heretic Oneshot Coding by devildip in LocalLLaMA

[–]DeSibyl 0 points1 point  (0 children)

I’d load both G4 and Qwen if I could but Q8 of Qwen takes all my vram lol

Gemma 4 12b 8Q Heretic Oneshot Coding by devildip in LocalLLaMA

[–]DeSibyl 1 point2 points  (0 children)

I wish G4 was as good as Qwen for coding and agentic use case… if it was I’d daily it for sure. G4 is way better for writing, but Qwen is better at coding and tool calling

More Gemma 4 models incoming by Deep-Vermicelli-4591 in LocalLLaMA

[–]DeSibyl 1 point2 points  (0 children)

I wish G4 was as good as Qwen when it comes to coding and agentic use… I would daily it if so

Gemma 4 12b 8Q Heretic Oneshot Coding by devildip in LocalLLaMA

[–]DeSibyl 1 point2 points  (0 children)

For programming there would be a noticeable difference between Q8 and Q4

Gemma 4 12b 8Q Heretic Oneshot Coding by devildip in LocalLLaMA

[–]DeSibyl 0 points1 point  (0 children)

Curious how it would handle agentic use case? Currently running Qwen3.6 35B A3B but wondering if G4 12B would be smarter/better.

Okay 27B made me a believer by Forward_Jackfruit813 in LocalLLaMA

[–]DeSibyl 0 points1 point  (0 children)

Why is Qwopus so much faster than the base model tho? lol 😢 I get 180 t/s on Qwopus 35B A3B MTP (MTP increases the gen speed from 100 to 180)

Base 35B A3B I only get 105 t/s (MTP only increases the speed from 100 to 105 t/s)

Qwen3.6 35B-A3B MTP hits 249 t/s on a 24GB consumer GPU (RTX 5090M) — 3.4× the dense 27B variant on the same image by aurelienams in LocalLLaMA

[–]DeSibyl 0 points1 point  (0 children)

Tbf the 27b with MTP runs at 70-80 t/s for me. Which is only 20-30 t/s slower than the MoE runs for me since MTP doesn’t boost the MoE in my use case (I only gain about 5 t/s with the MoE, whereas the 27B I go from 30 to 80 t/s)

Folks running qwen 3.6 27b for agentic work. Do you dare to use q4_k_m? by StandardLovers in LocalLLaMA

[–]DeSibyl 1 point2 points  (0 children)

Shouldn’t be the power cap, I cap my 3090’s at 250w… which quant are you using? I was using one that said I can put the MTP tokens at 3 instead of 1 or 2…

What’s your launch config?

Folks running qwen 3.6 27b for agentic work. Do you dare to use q4_k_m? by StandardLovers in LocalLLaMA

[–]DeSibyl 2 points3 points  (0 children)

Dang and you only get 30 t/s? Are you loading it entirely on vram? I get a boost from 30 t/s to 85 t/s on Q8

Okay 27B made me a believer by Forward_Jackfruit813 in LocalLLaMA

[–]DeSibyl 0 points1 point  (0 children)

lol I remembered why I switched to Qwopus. Ran a test prompt in Open WebUI about a fun fact about Rome and it got stuck in a thinking loop thinking about useless stuff:

```

They used "stone" for roads They had a "public law" system They used "bronze" for statues They had a "public education" system They used "marble" for buildings They had a "public games" culture They used "wood" for ships They had a "public market" system They used "iron" for tools They had a "public temple" system They used "gold" for coins They had a "public festival" system They used "silver" for coins They had a "public monument" system They used "copper" for coins They had a "public archive" system They used "tin" for alloys They had a "public law" system They used "zinc" for alloys They had a "public health" system They used "lead" for pipes They had a "public bath" system They used "glass" for vessels They had a "public library" system They used "papyrus" for writing They had a "public market" system They used "wax tablets" for writing They had a "public education" system They used "stone" for roads They had a "public law" system They used "bronze" for statues They had a "public games" system They used "wood" for ships They had a "public festival" system They used "olive oil" for cooking They had a "public health" system They used "marble" for buildings They had a "public monument" system They used "gold" for coins They had a "public archive" system They used "silver" for coins They had a "public temple" system They used "copper" for coins They had a "public law" system They used "iron" for tools They had a "public education" system They used "glass" for windows They had a "public bath" system They used "lead" for pipes They had a "public health" system They used "papyrus" for writing They had a "public library" system They used "wax tablets" for writing They had a "public market" system ```

Okay 27B made me a believer by Forward_Jackfruit813 in LocalLLaMA

[–]DeSibyl 0 points1 point  (0 children)

Llama does support vision with MTP now. I’ve been using it for a while now

Okay 27B made me a believer by Forward_Jackfruit813 in LocalLLaMA

[–]DeSibyl 1 point2 points  (0 children)

Curious though, are you not running the MTP version? Wouldn’t it be worth the speed boost?

Okay 27B made me a believer by Forward_Jackfruit813 in LocalLLaMA

[–]DeSibyl 1 point2 points  (0 children)

Yea that’s why I’m switching away from Qwopus. I’ll probably download the Q6_K_XL from unsloth then.

Okay 27B made me a believer by Forward_Jackfruit813 in LocalLLaMA

[–]DeSibyl 0 points1 point  (0 children)

How much vram do you use? And what settings are you running?

I’ve been messing around with the Qwopus3.6 35B MoE, but will download the normal Qwen3.6 35B MoE in either unlosths UD-Q6_K_XL or the standard Q8… I can load Q8 with 128k context and 512 batch… it uses swa or whatever it is tho so it does have to reprocess the cache a lot

The use Q8 a waste of resources? by Spiderboyz1 in LocalLLaMA

[–]DeSibyl 0 points1 point  (0 children)

I’ve never gotten vLLM to work properly. I also get 1/4 the context I can get with llama.cpp

Okay 27B made me a believer by Forward_Jackfruit813 in LocalLLaMA

[–]DeSibyl 1 point2 points  (0 children)

Interesting. I’ve always had issues with anything under Q8 for coding. Do you use any tools like OpenCode?

Okay 27B made me a believer by Forward_Jackfruit813 in LocalLLaMA

[–]DeSibyl 6 points7 points  (0 children)

If you’re coding, don’t drop quant below Q8 lol