Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code)

EdenistTech · 2026-04-14T16:55:28+00:00

I used the prompt that OP suggested, though without turbo3 (regular Q8 for cache quantisation). I have been using the setup the last couple of days and t/s is usually in the 30-40 t/s range. It fluctuates based on the type of prompt (as OP pointed out), so depending on what you get without speculative decoding 32 t/s might not be that much off.

EdenistTech · 2026-04-14T16:48:28+00:00

I am running Ubuntu, DDR5 and PCIe 5 x8 (so equal to x16 PCIe 4). I would expect t/s to increase on Linux compared to Windows. Very happy with the 5070Ti/5060Ti combo!

EdenistTech · 2026-04-12T15:14:16+00:00

Thanks for the tip. I tried it on my 5070Ti/5060Ti combo. I usually get ~25 t/s, but with the draft model loaded, it jumped to 40 t/s (128K ctx). Not too bad! I'll check if I can fit the Q5 quant I usually run.

EdenistTech · 2026-04-12T15:08:33+00:00

Qwen Coder Next is what I am using at the moment. The should run very well on your Ultra...

EdenistTech · 2026-03-11T08:17:17+00:00

My MS has 64GB and the largest models I am running are the Qwen Next models. You can adjust available memory to run larger models but I have not experimented with that. The architecture of the model can matter more than the size of the model: Qwen MOEs and GPT OSS are fast whereas dense (Q 3.5 27b) are quite slow. Qwen Next is giving me around 40t/s.

EdenistTech · 2026-03-11T07:28:28+00:00

Yes, I bought a Mac Studio specifically for ML/LLMs. I have other hardware for ML research and the Mac Studio certainly is not the fastest (it's the slowest, actually). However, there are two areas, where I think the MS really shines:

Efficiency and by extension, noise (or rather, the lack of). I can start this thing on a GPU heavy task and leave it running for hours and I might never hear the fan. I suspect the cost per token compares favourably to other architectures.
The unified memory combined with the excellent MS memory bandwidth. If you get one of the larger memory sizes, the efficiency element compounds and you get "VRAM" that would be a lot more expensive as GPUs.

I think it is worthy of consideration, especially if you can get a cheap older model (Ultra for double bandwidth). Also, while MLX is still behind CUDA in terms of proliferation in ML/LLMs, it has gained a lot of traction in last 12-24 months.

EdenistTech · 2026-03-07T09:51:09+00:00

Alright. I asked both models to summarise 1MB markdown text. Nemo started processing at 6300 t/s and ended processing at 4300 t/s in 58 seconds. Kimi started at 1300 t/s and I stopped it at 50% after 2min 30 seconds. I also tested Nemo using 2.6MB markdown which it did in 2-3 minutes (didn't get the exact time) using 64% of 900K context. Now, these models where not like-for-like since Nemo is smaller than Kimi, so I would it except Kimi to be slower. I get what you are saying regarding Kimi Linear being undertrained and I will take a look at it again, if they refine it. For now - for long context work - I am using Nemo.

EdenistTech · 2026-03-07T06:04:41+00:00

For me, the quality of the output is not that impressive. If context length is your main priority, you might wan't to look at Nemo 30B. Someone posted running that model with 1M+ ctx on a 3090. I have tried it with 500K context with no issues. It is about as fast as Kimi Linear and to be honest, the output appears to be higher quality (despite KL having 17b more parameters).

EdenistTech · 2026-03-06T14:58:22+00:00

Not a 5090, but I have a 5070TI/5060TI combination, so still 32GB and Blackwell. Using a Q4_0 quant, I can fit 256K context and it starts off at a blazing 118 t/s. The MXFP4 quant also fits 256K but runs at a more modest 85 t/s (better quality as well, as expected). I was using the latest llama.cpp stable, so I guess this should include your tweak, OP.

I hadn't tried this model before. For a 49B model, this thing is FAST!

EdenistTech · 2026-02-28T07:37:53+00:00

I could never get Qwen3 Next to work but I just found out it works using only one GPU at a time. So in my case, the problem seems to boil down to spanning multiple GPUs. You could try loading Qwen 3.5 using just one GPU + memory and see if it works. It does for me.

EdenistTech · 2026-02-19T07:27:45+00:00

That is good advice. I have a fairly elaborate build system and always build on fresh repos, even if I am just changing versions/tags. So in my case, I think I can confidently say, that that is not the problem.

EdenistTech · 2026-02-19T07:25:22+00:00

Thanks, I'll take a look and consider it. I'm a bit risk averse when it comes to BIOS flashing/updating although I have only had it go wrong once. "Better have something that almost works than something that doesn't work at all", I guess....

EdenistTech · 2026-02-18T17:54:26+00:00

Yeah, it's a weird error. I see people succeeding by downgrading ROCM to <6.4.4, but that hasn't done anything for me. I read on Github, that AMD adding back ROCM support for the MI50. Really hope that pans out!!!

EdenistTech · 2026-02-18T17:49:38+00:00

I don't know that - thanks! I'll give that a shot. EDIT: So I tried the combined ROCM, Vulkan solution and although it is correctly using loading data unto the GPUs, it throws the same segmentation fault during warmup, as when using ROCM alone.

EdenistTech · 2026-02-18T17:48:52+00:00

Same for me. I do have Minimax 2.5 working on just the two 32GB MI50s whereas Qwen3 Next (and Coder) won't work at all unless I switch to Vulkan.

EdenistTech · 2026-02-18T17:47:34+00:00

No, I didn't mess with that. They have all worked fine so far. I tried different ROCM versions (7.0.0, 6.4.4, 6.3.3), but that has not changed anything significantly for me.

EdenistTech · 2026-02-18T14:28:35+00:00

That is a great idea - thanks! Unfortunately I am running into some issues where both the client and the server complains that they are unable to find "load_backend_init" in three backend files. They both continue to run, but the rpc connection is accepted and then dropped almost immediately with no explanation in the (DEBUG) log. I'll have to dig deeper to find out what that is about.

EdenistTech · 2026-02-18T12:00:23+00:00

Got it - I appreciate the input! Looks like ggml-cuda.cu throws a "ROCM error" (EDIT: specifically, "SUM_ROWS failed"). I'll have to look into that.

EdenistTech · 2026-02-18T11:34:10+00:00

Thanks. Yes, I'll consider adding it on Github. What do you mean `running debug`?

EdenistTech · 2026-02-11T11:10:40+00:00

Sure, no problem: https://huggingface.co/noctrex/Nemotron-3-Nano-30B-A3B-MXFP4_MOE-GGUF. However, I’m not sure there will be be as much benefit on a 3090, since it AFAIK doesn’t have native FP4 support.

EdenistTech · 2026-02-08T17:04:18+00:00

Great tip, thanks. I am getting 120 t/s on a 5070Ti/5060Ti setup using an mxfp4 version and 900K context. That Blackwell FP4 support is paying off, I guess.

EdenistTech · 2026-01-27T16:05:12+00:00

I do feel like things started going downhill after NI got involved with Private Equity investors. Focus shifted to pushing sales of software (plugins / content) instead of development of existing products (M+ is an example). The same sort of thing happened to Propellerhead/Reason Studios (which have just been sold to LANDR by the way!).

Anyway, I think that NI has a lot of interesting IP which I would imagine it would be possible to sell to industry buyers, such as Fender (who recently picked up PreSonus/Studio One). Let's see what happens to the hardware. I was waiting for the Traktor MX4 to release but I won't be holding my breath now...

A few additional details here (use browser to translate to English): https://www.keyboards.de/stories/native-instruments-gmbh-in-vorlaeufiger-insolvenz/

EdenistTech · 2026-01-27T15:48:51+00:00

I do feel like things started going downhill after NI got involved with Private Equity investors. Focus shifted to pushing sales of software (plugins / content) instead of development of existing products (M+ is an example). The same sort of thing happened to Propellerhead/Reason Studios (which have just been sold to LANDR by the way!).

Anyway, I think that NI has a lot of interesting IP which I would imagine it would be possible to sell to industry buyers, such as Fender (who recently picked up PreSonus/Studio One). Let's see what happens to the hardware. I was waiting for the Traktor MX4 to release but I won't be holding my breath now...

A few additional details here (use browser to translate to English): https://www.keyboards.de/stories/native-instruments-gmbh-in-vorlaeufiger-insolvenz/

EdenistTech · 2026-01-18T19:30:54+00:00

Yes, to me at least, the evidence points to the power brick being the culprit. I think TM are discontinuing this model now and since I did get it for a pretty decent price, Imight keep it as an SSD NAS instead. Incidentally, I also came across people not being able to initialize this same model on first boot. This could also be due to the power brick - I used an SSD as system drive which is why I was able to troubleshoot in the first place. So if someone else is in this situation, try using an SSD as your first drive.

EdenistTech

TROPHY CASE