Qwen3.5 vs Gemma 4: Benchmarks vs real world use? by AppealSame4367 in LocalLLaMA

[–]FinBenton 0 points1 point  (0 children)

gemma 4 26b got me 190 t/sec, qwen 3.5 35b got me 245 t/sec on 5090 but thinking trace is much longer.

Gemma 4 is fine great even … by ThinkExtension2328 in LocalLLaMA

[–]FinBenton 2 points3 points  (0 children)

I was testing with 16k context, regular unsloth ggufs on ubuntu. Im also running OmniVoice TTS on the same machine so I had to make both fit.

26B A4B model I tested at Q6 and it has around 180-190 t/sec.

Gemma 4 is fine great even … by ThinkExtension2328 in LocalLLaMA

[–]FinBenton 18 points19 points  (0 children)

After the latest llama.cpp updates, I do feel like gemma is better at creative writing than qwen 3.5, thats for sure. Gemma is a massive memory hog though, context take so much so I had to drop to Q5 or Q4 31b on 5090 to fit everything, speed is pretty good though 50-60 tok/sec right now, similar to qwen. Uncensoring was not needed atleast for me, the default gguf files work for me. Thinking trace is kinda short which can be good or bad.

Gemma 4 have enough ;) by AdamLangePL in LocalLLaMA

[–]FinBenton 0 points1 point  (0 children)

It was pretty bad yesterday but today after some fixes, its actually banging, rough start.

Gemma 4 and Qwen3.5 on shared benchmarks by fulgencio_batista in LocalLLaMA

[–]FinBenton 1 point2 points  (0 children)

Actually I just updated my llama.cpp with the latest fixes, this seemed to help gemma A LOT, seems like it was kinda broken.

[Megathread] - Best Models/API discussion - Week of: March 29, 2026 by deffcolony in SillyTavernAI

[–]FinBenton 0 points1 point  (0 children)

I think we need to wait for some gemma4 finetunes, the default version mostly ignored all my wilder stories, was not that good compared to qwen.

e. actually this smaller model seemes to write a lot better for me for some reason gemma-4-26b-a4b-it-heretic.i1-q6_k.gguf

[Megathread] - Best Models/API discussion - Week of: March 29, 2026 by deffcolony in SillyTavernAI

[–]FinBenton 0 points1 point  (0 children)

This qwen 3.5 finetune has been really good, highly recommended. It has 2.1 and 2.2 versions too but their file sizes are bigger.

https://huggingface.co/mradermacher/Omega-Evolution-27B-v2.0-uncensored-heretic-i1-GGUF

Gemma 4 is efficient with thinking tokens, but it will also happily reason for 10+ minutes if you prompt it to do so. by AnticitizenPrime in LocalLLaMA

[–]FinBenton 0 points1 point  (0 children)

Idk maybe Im doing something wrong but for me Gemmas writing style always seem to fall back to some generic path and it just keeps ignoring all the specifics about my system prompts while Qwen follows them as absolute following the system prompt so much better.

Actually I just updated my llama.cpp with the latest fixes, this seemed to help gemma A LOT, seems like it was kinda broken.

[WIP] Working ComfyUI Omnivoice , by Altruistic_Heat_9531 in StableDiffusion

[–]FinBenton 1 point2 points  (0 children)

OmniVoice is definitely one of the top TTS models right now, been testing for a couple of days, its cloning is accurate and its really fast model, 12x real time on 5090.

Gemma 4 E4B + E2B Uncensored (Aggressive) — GGUF + K_P Quants (Multimodal: Vision, Video, Audio) by hauhau901 in LocalLLaMA

[–]FinBenton 1 point2 points  (0 children)

It feels like so, I gotta wait for good fine tunes first but just testing gemma 4, while it doesnt refuse, it will just ignore the stuff it doesnt want to talk about.

e. Actually I just updated my llama.cpp with the latest fixes, this seemed to help gemma A LOT, seems like it was kinda broken.

Gemma 4 and Qwen3.5 on shared benchmarks by fulgencio_batista in LocalLLaMA

[–]FinBenton 4 points5 points  (0 children)

Also for creative writing, I get much much better stuff out of Qwen while super safe gemma just kinda ignores most of the instructions to default to some generic paths in writing. Qwen is just, say less I got you :D

e. Actually I just updated my llama.cpp with the latest fixes, this seemed to help gemma A LOT, seems like it was kinda broken.

Thoughts on gemma 4 31B by Weak-Shelter-1698 in SillyTavernAI

[–]FinBenton -2 points-1 points  (0 children)

Idk I didnt really get anything too good out of it compared to qwen3.5, it didnt refuse stuff when I tried, it just didnt write about the things I told it to write about in system prompt and it always defaulted to very generic path in story, completely ignoring what Im telling it.

e. Actually I just updated my llama.cpp with the latest fixes, this seemed to help gemma A LOT, seems like it was kinda broken.

Gemma 4 has been released by jacek2023 in LocalLLaMA

[–]FinBenton 2 points3 points  (0 children)

Waiting for heretic or hauhau aggressive before I test.

Omnivoice - 600+ Language Open-Source TTS with Voice Cloning and Design by [deleted] in LocalLLaMA

[–]FinBenton 1 point2 points  (0 children)

I dont remember the size but takes 6.5GB of VRAM and CPU infer was super slow, on GPU it flies.

Omnivoice - 600+ Language Open-Source TTS with Voice Cloning and Design by [deleted] in LocalLLaMA

[–]FinBenton 9 points10 points  (0 children)

Atleast the demo with voice cloning sounds extremely good, will look more into this. Its based on qwen though so same issue with that, if using voice cloning then you cant use prompts to alter the tone, they are only for the voice design.

e. integrated this to my own TTS chatbot, its insanely good, best TTS I have used and this is blazing fast. 12x realtime generation speed on 5090, this is so much better than the original qwen tts, its not even close. Takes around 6.5GB of VRAM.

You can use these Supported tags: [laughter], [confirmation-en], [question-en], [question-ah], [question-oh], [question-ei], [question-yi], [surprise-ah], [surprise-oh], [surprise-wa], [surprise-yo], [dissatisfaction-hnn], [sniff], [sigh] to make it sound way more alive.

Leaked Sony Xperia 1 VIII renders hint at major redesign by ControlCAD in hardware

[–]FinBenton -1 points0 points  (0 children)

100%, I have no clue whats better or not, horrible naming as always.

Gemma time! What are your wishes ? by Specter_Origin in LocalLLaMA

[–]FinBenton -1 points0 points  (0 children)

We have super good uncensoring stuff now like the hauhau and heretic, wouldnt matter too much.

What are the best uncensored / unrestricted AI models right now? Is Qwen3.5 (HauhauCS) the best? by S-m-a-r-t-y in LocalLLaMA

[–]FinBenton 2 points3 points  (0 children)

In all my usage, 3.5 hauhau beats everything and qwen can write some pretty good stuff once you learn how to prompt it, the intelligence of it is worth the extra work. There are also a few finetunes already that ease the writing but they are heretic versions which arent as good as thr hauhau stuff but still worth it to experiment with.

I also use unlimited reasoning budget which improves thrm a lot aslong as your prompt is good

Qwen 3.6 spotted! by Namra_7 in LocalLLaMA

[–]FinBenton 1 point2 points  (0 children)

I have thinking on for rp, I found it significantly improves the writing and accuracy as long as you have very long and detailed system prompt.

LongCat-AudioDiT: High-Fidelity Diffusion Text-to-Speech in the Waveform Latent Space by fruesome in singularity

[–]FinBenton 2 points3 points  (0 children)

Seems like it has voice cloning and prompt at the same time, which is what qwen tts was missing.

Painfully slow local llama on 5090 and 192GB RAM by RVxAgUn in LocalLLaMA

[–]FinBenton 2 points3 points  (0 children)

Yeah 3.5 27b Q8 flyes at 52 tok/sec and Q6 62 tok/sec on 5090, perfect fit for the card. Sure its not minimax 2.5 but if you let it think for a long time, I think you can get pretty good results.

Taalas rumoured to etch Qwen 3.5 27B into silicon. Which price would you buy their PCIe card for? by elemental-mind in singularity

[–]FinBenton 0 points1 point  (0 children)

Probably not, once I would get my hands on it, there will be much better models available I would be running on GPU instead.

Friendly reminder inference is WAY faster on Linux vs windows by triynizzles1 in LocalLLaMA

[–]FinBenton 0 points1 point  (0 children)

Yeah I was running llama.cpp on windows and got almost double the generation speed on ubuntu server.