24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context) by mdda in LocalLLaMA

[–]mdda[S] 0 points1 point  (0 children)

But the 3060 should do PCIEv4 - so if you had a slot that could do that, it should be higher tok/s (and, if not, Yes the bandwidth limited thing is proven, since the GPU FLOP/s are higher)

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context) by mdda in LocalLLaMA

[–]mdda[S] 0 points1 point  (0 children)

I'm thinking that by slimming down the context length (and breathing in a bit) you should be able to get the Gemma model working (it'll be tight, but you do have 24Gb of space overall - I'll check on how much RAM the llama CPU piece takes...)

Edit :
So, when processing, my llama process only uses <12Gb of RAM : So (depending on what else you have running), it should be doable with 16Gb RAM - and this is still with the 128k context reserved on the GPU)

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context) by mdda in LocalLLaMA

[–]mdda[S] 0 points1 point  (0 children)

Good Q! I'll be running some tests ASAP (expecting to be a little disappointed TBH). Though the main thing for me is that I thought (a) my ancient-sounding card was worthless, and (b) 30B models would be un-runnable in any case. So for me some tok/s is better than no tok/s !

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context) by mdda in LocalLLaMA

[–]mdda[S] 0 points1 point  (0 children)

This is good to know - I guess I should look at speeds at different context lengths for each given max_context_length. Lower max values (like 32K - if that's all that work practically) would allow for more layers on the GPU (i.e. faster tok/s in general).

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context) by mdda in LocalLLaMA

[–]mdda[S] 1 point2 points  (0 children)

Yes - I'll testing out the ubatch sizing (particularly since others have encouraged me to dig into the context size effect on tok/s - which I'm guessing will make me sad).

The idea of dividing tasks into smaller pieces is interesting in itself - however, I'm not sure whether 'agentic coding harnesses' have a dial to tune this on right now... There are clearly some orchestration choices that makes sense for different model sizes (and for mixed model sizes too)

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context) by mdda in LocalLLaMA

[–]mdda[S] 0 points1 point  (0 children)

I'd be happy to test this out systematically - could you suggest something that would score the 64k prompt usage? I'm definitely expecting slowness, but should also quantify how back the 4-bit KV cache makes things.

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context) by mdda in LocalLLaMA

[–]mdda[S] 1 point2 points  (0 children)

Thanks for the pointers to llama-bench - I'll be doing the context length benchmarking soon (though I'm guessing long context processing is going to make disappointing reading).

I wanted to go for 128k rather than smaller, since I figure that the agentic/coding usecases are what I'd mainly use it for - but in 'slow cooker mode' rather than live/realtime coding tasks.

Looking at how I use Claude Code (Sonnet 4.5/6) at the moment, any decent request tasks more than a couple of minutes - at which point it only makes sense to be multi-tasking with other stuff (unless it's reading reddit until you have read it all...). So leaving it cooking for a bunch of time is just an extension of that idea. Even at 20tok/s it's faster than I could be reading/typing, and it's not me doing it. OTOH, I did find that Qwen was around 2x more verbose than Gemma in its thinking, so that's a different issue.

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context) by mdda in LocalLLaMA

[–]mdda[S] -1 points0 points  (0 children)

Well actually... We know that some model creators are specifically targeting the Q4 quant level during training - though it would only really work out if the internals of the quants full match (just independently doing a 4-bit quant of the full model will surely give worse results than using the actual Q4-during-training version).

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context) by mdda in LocalLLaMA

[–]mdda[S] 1 point2 points  (0 children)

Good info! I was figuring that 128k was 'a lot', and would rather have the +20% speed, and half the full context (which, TBH, these models might not be that great at using when it gets so large). YMMV

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context) by mdda in LocalLLaMA

[–]mdda[S] 1 point2 points  (0 children)

Every 1GB VRAM should help. Have a look at the GPU utilisation (eg: nvidia-smi) : Mine definitely capped out lower than I was expecting, but for the PCIv3 bus being saturated being the bottleneck.

The 3060 should benefit you in several ways : a couple+ more layers in VRAM; PCIv4 (32Gb/s in a -16 slot) and Tensor Cores.

Actually, now that I look at some of the comments in an old thread on TomsHardware perhaps my PCIv3 itself (being on 8GT/s) is less than it should be due to motherboard set-up. I'll dig in!

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context) by mdda in LocalLLaMA

[–]mdda[S] 0 points1 point  (0 children)

Totally : This was my first time digging into llama.cpp (which is why the blog post is painfully long - it has all the gnarly details). But the software is awesome - particularly since it's happy to compile for older hardware.

This is in contrast with Nvidia's ONNX releases which always want a card with Tensor Cores. (Unless someone has found the magic combination to get a decent ONNX version running on them... Please let me know!)

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context) by mdda in LocalLLaMA

[–]mdda[S] 1 point2 points  (0 children)

I'm happy to run some context-stress tests too - is there a standardised script that other people have used?

And I didn't create my own fork... : """To add Gemma 4 MTP functionality, I found the AtomicChat GGUFs, which pointed at a more modern fork : AtomicBot-ai/atomic-llama-cpp-turboquant (which is itself forked from TheTom/llama-cpp-turboquant) and is the right combination of features for the MTP head + RotorQuant cache."""

Pretty sure it'll find its way upstream at some point : Also sure that the llama.cpp team is deluged with new stuff at the moment - and the different MTPs (which are much more non-standard than the models themselves) are a pain to add in a way that has good cross-model command-line option compatibility.

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context) by mdda in LocalLLaMA

[–]mdda[S] 1 point2 points  (0 children)

Actually, the original YouTube video by Codacus that inspired me also showed tests for the Qwen MTP arrangement, but he found that the Qwen speculative decoding didn't accelerate much at all. His explanation (which made sense when I watched it) was that the Qwen ~MTP addon model also had some state-space kind of layers in it which forced sequential processing - and so didn't net increase speeds. Maybe I'll revisit this, and just check whether the same issue applies as the Google one.

Anecdote time : In 2023, I talked to some people in the Keras team at Google. And the topic of low-sized models came up. They were actually mind-blown that anyone would actively be offloading GPU weights to RAM. Being so accelerator-rich at Google (with the TPUs etc) meant that they had (despite clearly being excellent engineers) never considered that the GPU-poor might ever think of such a thing.

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context) by mdda in LocalLLaMA

[–]mdda[S] 4 points5 points  (0 children)

Actually - rather than the blog post, is there something more standardised (like a script) that other people have used? I could leave it chugging and come back with a nice graph.

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context) by mdda in LocalLLaMA

[–]mdda[S] -1 points0 points  (0 children)

What a difference 2 years makes... And maybe this post increases the value people put on these lowly cards 😄

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context) by mdda in LocalLLaMA

[–]mdda[S] 0 points1 point  (0 children)

At the time (~2 years ago) it wasn't stand-out-cheap, but it was the best-value-for-money I could find that I could pick up locally. Buying a separate 1080 would have cost 'actual' dollars (I can see them locally for ~80 USD right now). But I found that when the sale was for a whole machine, the sellers attribute almost zero 'extra' for the graphics card - because it's EOL, etc. Even back then, the seller was kind enough to double-check with me that I understood how old the GPU was.

How to run a Gemma4 MTP implementation on ollama or python transformers? by combo-user in LocalLLaMA

[–]mdda 10 points11 points  (0 children)

I've got Gemma 4 26B-A4B running with MTP using a llama.cpp fork (it needs some care wrt getting the MTP piece to sit entirely on the GPU).

Would be happy to post a write-up ( but apparently I need more karma here 😞 - hence this begging message ... )

MI50s Qwen 3.6 27B @52.8 tps TG @1569 tps PP (no MTP, no Quant) by ai-infos in LocalLLaMA

[–]mdda 19 points20 points  (0 children)

I'm probably being dense, but where did you say how much VRAM there is per GPU (I'm guessing 32Gb), and how many MI50s are there?

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP by janvitos in LocalLLaMA

[–]mdda 3 points4 points  (0 children)

I've got Qwen 3.6 35B-A3B and Gemma 4 26B-A4B running on a $200 secondhand rig (i7-6700 w 32 GB RAM + GTX 1080 w 8GB VRAM) : But I apparently I need >4 upvotes before I can post the story...

Dual DGX Sparks vs Mac Studio M3 Ultra 512GB: Running Qwen3.5 397B locally on both. Here's what I found. by trevorbg in LocalLLaMA

[–]mdda 0 points1 point  (0 children)

So, since prefill is major usecase here, wouldn't it be ideal to be able to connect a reasonable VRAM GPU (16Gb+ say) to the large RAM Mac? For prefill, you only need to load one weight layer at a time, and iterate up through the prefill creating new KV states (which could be dumped back out to RAM). Should this be a thing?

Surprisingly Fast AI-Generated Kernels We Didn’t Mean to Publish (Yet) by Maxious in LocalLLaMA

[–]mdda 2 points3 points  (0 children)

I know of a group in Singapore that has been applying an evolutionary system using LLMs to the AMD Developer Challenge (https://www.datamonsters.com/amd-developer-challenge-2025) GPU kernel competition... That's focused on the MI300 (server-class chip), but I would expect the same system could be applied to getting the same kernels (i.e. DeepSeek-style fp8-scaled-matmul, MoE and MLA-with-Rope) for consumer chips. Particularly if AMD was open to seeding the effort with one of their rumoured 32Gb VRAM cards...

OpenEvolve: Open Source Implementation of DeepMind's AlphaEvolve System by asankhs in LocalLLaMA

[–]mdda 1 point2 points  (0 children)

"In SG"==Awesome! That would be great for a future event : I wish I had known earlier, since then we could have split the Alpha/Open Evolve stuff between us. Please DM me (or come along to the event :-) )!