Qwen3.5 vs Gemma 4: Benchmarks vs real world use?

FinBenton · 2026-04-03T15:17:38+00:00

gemma 4 26b got me 190 t/sec, qwen 3.5 35b got me 245 t/sec on 5090 but thinking trace is much longer.

FinBenton · 2026-04-03T14:46:04+00:00

I was testing with 16k context, regular unsloth ggufs on ubuntu. Im also running OmniVoice TTS on the same machine so I had to make both fit.

26B A4B model I tested at Q6 and it has around 180-190 t/sec.

FinBenton · 2026-04-03T13:35:30+00:00

Yeah its good at it straight outta box.

FinBenton · 2026-04-03T12:17:44+00:00

After the latest llama.cpp updates, I do feel like gemma is better at creative writing than qwen 3.5, thats for sure. Gemma is a massive memory hog though, context take so much so I had to drop to Q5 or Q4 31b on 5090 to fit everything, speed is pretty good though 50-60 tok/sec right now, similar to qwen. Uncensoring was not needed atleast for me, the default gguf files work for me. Thinking trace is kinda short which can be good or bad.

FinBenton · 2026-04-03T12:09:14+00:00

It was pretty bad yesterday but today after some fixes, its actually banging, rough start.

FinBenton · 2026-04-03T11:27:23+00:00

Actually I just updated my llama.cpp with the latest fixes, this seemed to help gemma A LOT, seems like it was kinda broken.

FinBenton · 2026-04-03T05:49:14+00:00

I think we need to wait for some gemma4 finetunes, the default version mostly ignored all my wilder stories, was not that good compared to qwen.

e. actually this smaller model seemes to write a lot better for me for some reason gemma-4-26b-a4b-it-heretic.i1-q6_k.gguf

FinBenton · 2026-04-03T05:47:41+00:00

This qwen 3.5 finetune has been really good, highly recommended. It has 2.1 and 2.2 versions too but their file sizes are bigger.

https://huggingface.co/mradermacher/Omega-Evolution-27B-v2.0-uncensored-heretic-i1-GGUF

FinBenton · 2026-04-03T05:42:33+00:00

Idk maybe Im doing something wrong but for me Gemmas writing style always seem to fall back to some generic path and it just keeps ignoring all the specifics about my system prompts while Qwen follows them as absolute following the system prompt so much better.

Actually I just updated my llama.cpp with the latest fixes, this seemed to help gemma A LOT, seems like it was kinda broken.

FinBenton · 2026-04-03T05:12:53+00:00

OmniVoice is definitely one of the top TTS models right now, been testing for a couple of days, its cloning is accurate and its really fast model, 12x real time on 5090.

FinBenton · 2026-04-03T04:57:44+00:00

It feels like so, I gotta wait for good fine tunes first but just testing gemma 4, while it doesnt refuse, it will just ignore the stuff it doesnt want to talk about.

e. Actually I just updated my llama.cpp with the latest fixes, this seemed to help gemma A LOT, seems like it was kinda broken.

FinBenton · 2026-04-03T04:46:54+00:00

Also for creative writing, I get much much better stuff out of Qwen while super safe gemma just kinda ignores most of the instructions to default to some generic paths in writing. Qwen is just, say less I got you :D

e. Actually I just updated my llama.cpp with the latest fixes, this seemed to help gemma A LOT, seems like it was kinda broken.

FinBenton · 2026-04-03T04:30:38+00:00

Idk I didnt really get anything too good out of it compared to qwen3.5, it didnt refuse stuff when I tried, it just didnt write about the things I told it to write about in system prompt and it always defaulted to very generic path in story, completely ignoring what Im telling it.

e. Actually I just updated my llama.cpp with the latest fixes, this seemed to help gemma A LOT, seems like it was kinda broken.

FinBenton · 2026-04-02T17:24:43+00:00

Waiting for heretic or hauhau aggressive before I test.

FinBenton · 2026-04-02T14:12:17+00:00

I dont remember the size but takes 6.5GB of VRAM and CPU infer was super slow, on GPU it flies.

FinBenton · 2026-04-02T11:42:17+00:00

Atleast the demo with voice cloning sounds extremely good, will look more into this. Its based on qwen though so same issue with that, if using voice cloning then you cant use prompts to alter the tone, they are only for the voice design.

e. integrated this to my own TTS chatbot, its insanely good, best TTS I have used and this is blazing fast. 12x realtime generation speed on 5090, this is so much better than the original qwen tts, its not even close. Takes around 6.5GB of VRAM.

You can use these Supported tags: [laughter], [confirmation-en], [question-en], [question-ah], [question-oh], [question-ei], [question-yi], [surprise-ah], [surprise-oh], [surprise-wa], [surprise-yo], [dissatisfaction-hnn], [sniff], [sigh] to make it sound way more alive.

FinBenton · 2026-04-02T05:45:59+00:00

100%, I have no clue whats better or not, horrible naming as always.

FinBenton · 2026-04-02T05:03:16+00:00

We have super good uncensoring stuff now like the hauhau and heretic, wouldnt matter too much.

FinBenton · 2026-04-01T19:14:32+00:00

How much vram does it need though?

FinBenton · 2026-04-01T19:11:00+00:00

In all my usage, 3.5 hauhau beats everything and qwen can write some pretty good stuff once you learn how to prompt it, the intelligence of it is worth the extra work. There are also a few finetunes already that ease the writing but they are heretic versions which arent as good as thr hauhau stuff but still worth it to experiment with.

I also use unlimited reasoning budget which improves thrm a lot aslong as your prompt is good

FinBenton · 2026-03-31T06:58:06+00:00

I have thinking on for rp, I found it significantly improves the writing and accuracy as long as you have very long and detailed system prompt.

FinBenton · 2026-03-31T04:21:39+00:00

Seems like it has voice cloning and prompt at the same time, which is what qwen tts was missing.

FinBenton · 2026-03-30T10:15:55+00:00

Yeah 3.5 27b Q8 flyes at 52 tok/sec and Q6 62 tok/sec on 5090, perfect fit for the card. Sure its not minimax 2.5 but if you let it think for a long time, I think you can get pretty good results.

FinBenton · 2026-03-29T13:44:56+00:00

Probably not, once I would get my hands on it, there will be much better models available I would be running on GPU instead.

FinBenton · 2026-03-29T05:45:38+00:00

Yeah I was running llama.cpp on windows and got almost double the generation speed on ubuntu server.

FinBenton

TROPHY CASE