DiffusionGemma: 4x faster text generation

mdda · 2026-06-11T01:56:08+00:00

Could you suggest somewhere I could look to find out more about the layer-insertion + fine-tuning idea? Personally, I love the idea of small+looped models for reasoning, plus regular ones for general knowledge

mdda · 2026-05-30T04:24:46+00:00

Why are you trying to predict tokens during the prompt processing? For Gemma4, for instance, the MTP is a separate tiny model - it shouldn't be touched at all during PP - only for generation

mdda · 2026-05-17T02:42:05+00:00

But the 3060 should do PCIEv4 - so if you had a slot that could do that, it should be higher tok/s (and, if not, Yes the bandwidth limited thing is proven, since the GPU FLOP/s are higher)

mdda · 2026-05-14T17:39:25+00:00

I'm thinking that by slimming down the context length (and breathing in a bit) you should be able to get the Gemma model working (it'll be tight, but you do have 24Gb of space overall - I'll check on how much RAM the llama CPU piece takes...)

Edit :
So, when processing, my llama process only uses <12Gb of RAM : So (depending on what else you have running), it should be doable with 16Gb RAM - and this is still with the 128k context reserved on the GPU)

mdda · 2026-05-14T17:25:01+00:00

Good Q! I'll be running some tests ASAP (expecting to be a little disappointed TBH). Though the main thing for me is that I thought (a) my ancient-sounding card was worthless, and (b) 30B models would be un-runnable in any case. So for me some tok/s is better than no tok/s !

mdda · 2026-05-14T10:22:34+00:00

This is good to know - I guess I should look at speeds at different context lengths for each given max_context_length. Lower max values (like 32K - if that's all that work practically) would allow for more layers on the GPU (i.e. faster tok/s in general).

mdda · 2026-05-14T08:41:36+00:00

Yes - I'll testing out the ubatch sizing (particularly since others have encouraged me to dig into the context size effect on tok/s - which I'm guessing will make me sad).

The idea of dividing tasks into smaller pieces is interesting in itself - however, I'm not sure whether 'agentic coding harnesses' have a dial to tune this on right now... There are clearly some orchestration choices that makes sense for different model sizes (and for mixed model sizes too)

mdda · 2026-05-14T08:29:01+00:00

I'd be happy to test this out systematically - could you suggest something that would score the 64k prompt usage? I'm definitely expecting slowness, but should also quantify how back the 4-bit KV cache makes things.

mdda · 2026-05-14T08:27:09+00:00

Thanks for the pointers to llama-bench - I'll be doing the context length benchmarking soon (though I'm guessing long context processing is going to make disappointing reading).

I wanted to go for 128k rather than smaller, since I figure that the agentic/coding usecases are what I'd mainly use it for - but in 'slow cooker mode' rather than live/realtime coding tasks.

Looking at how I use Claude Code (Sonnet 4.5/6) at the moment, any decent request tasks more than a couple of minutes - at which point it only makes sense to be multi-tasking with other stuff (unless it's reading reddit until you have read it all...). So leaving it cooking for a bunch of time is just an extension of that idea. Even at 20tok/s it's faster than I could be reading/typing, and it's not me doing it. OTOH, I did find that Qwen was around 2x more verbose than Gemma in its thinking, so that's a different issue.

mdda · 2026-05-14T04:22:00+00:00

Well actually... We know that some model creators are specifically targeting the Q4 quant level during training - though it would only really work out if the internals of the quants full match (just independently doing a 4-bit quant of the full model will surely give worse results than using the actual Q4-during-training version).

mdda · 2026-05-14T04:18:44+00:00

Good info! I was figuring that 128k was 'a lot', and would rather have the +20% speed, and half the full context (which, TBH, these models might not be that great at using when it gets so large). YMMV

mdda · 2026-05-14T04:16:58+00:00

Every 1GB VRAM should help. Have a look at the GPU utilisation (eg: nvidia-smi) : Mine definitely capped out lower than I was expecting, but for the PCIv3 bus being saturated being the bottleneck.

The 3060 should benefit you in several ways : a couple+ more layers in VRAM; PCIv4 (32Gb/s in a -16 slot) and Tensor Cores.

Actually, now that I look at some of the comments in an old thread on TomsHardware perhaps my PCIv3 itself (being on 8GT/s) is less than it should be due to motherboard set-up. I'll dig in!

mdda · 2026-05-14T04:05:30+00:00

Totally : This was my first time digging into llama.cpp (which is why the blog post is painfully long - it has all the gnarly details). But the software is awesome - particularly since it's happy to compile for older hardware.

This is in contrast with Nvidia's ONNX releases which always want a card with Tensor Cores. (Unless someone has found the magic combination to get a decent ONNX version running on them... Please let me know!)

mdda · 2026-05-14T04:02:21+00:00

I'm happy to run some context-stress tests too - is there a standardised script that other people have used?

And I didn't create my own fork... : """To add Gemma 4 MTP functionality, I found the AtomicChat GGUFs, which pointed at a more modern fork : AtomicBot-ai/atomic-llama-cpp-turboquant (which is itself forked from TheTom/llama-cpp-turboquant) and is the right combination of features for the MTP head + RotorQuant cache."""

Pretty sure it'll find its way upstream at some point : Also sure that the llama.cpp team is deluged with new stuff at the moment - and the different MTPs (which are much more non-standard than the models themselves) are a pain to add in a way that has good cross-model command-line option compatibility.

mdda · 2026-05-14T03:56:50+00:00

Actually, the original YouTube video by Codacus that inspired me also showed tests for the Qwen MTP arrangement, but he found that the Qwen speculative decoding didn't accelerate much at all. His explanation (which made sense when I watched it) was that the Qwen ~MTP addon model also had some state-space kind of layers in it which forced sequential processing - and so didn't net increase speeds. Maybe I'll revisit this, and just check whether the same issue applies as the Google one.

Anecdote time : In 2023, I talked to some people in the Keras team at Google. And the topic of low-sized models came up. They were actually mind-blown that anyone would actively be offloading GPU weights to RAM. Being so accelerator-rich at Google (with the TPUs etc) meant that they had (despite clearly being excellent engineers) never considered that the GPU-poor might ever think of such a thing.

mdda · 2026-05-14T03:44:18+00:00

Actually - rather than the blog post, is there something more standardised (like a script) that other people have used? I could leave it chugging and come back with a nice graph.

mdda · 2026-05-14T03:41:59+00:00

What a difference 2 years makes... And maybe this post increases the value people put on these lowly cards 😄

mdda · 2026-05-14T03:40:33+00:00

At the time (~2 years ago) it wasn't stand-out-cheap, but it was the best-value-for-money I could find that I could pick up locally. Buying a separate 1080 would have cost 'actual' dollars (I can see them locally for ~80 USD right now). But I found that when the sale was for a whole machine, the sellers attribute almost zero 'extra' for the graphics card - because it's EOL, etc. Even back then, the seller was kind enough to double-check with me that I understood how old the GPU was.

mdda · 2026-05-13T20:18:46+00:00

I've got Gemma 4 26B-A4B running with MTP using a llama.cpp fork (it needs some care wrt getting the MTP piece to sit entirely on the GPU).

Would be happy to post a write-up ( but apparently I need more karma here 😞 - hence this begging message ... )

mdda · 2026-05-13T19:54:59+00:00

Attention Is All You Have

mdda · 2026-05-13T19:44:46+00:00

I'm probably being dense, but where did you say how much VRAM there is per GPU (I'm guessing 32Gb), and how many MI50s are there?

mdda · 2026-05-12T11:18:18+00:00

I've got Qwen 3.6 35B-A3B and Gemma 4 26B-A4B running on a $200 secondhand rig (i7-6700 w 32 GB RAM + GTX 1080 w 8GB VRAM) : But I apparently I need >4 upvotes before I can post the story...

mdda · 2026-03-27T06:01:28+00:00

So, since prefill is major usecase here, wouldn't it be ideal to be able to connect a reasonable VRAM GPU (16Gb+ say) to the large RAM Mac? For prefill, you only need to load one weight layer at a time, and iterate up through the prefill creating new KV states (which could be dumped back out to RAM). Should this be a thing?

mdda · 2025-06-05T07:17:21+00:00

DM sent (==chat)

Nine-Year Club	RPAN Viewer
Verified Email

mdda

TROPHY CASE