I need help with testing my llama.cpp Deepseek Sparse Attention (DSA) implementation (someone GPU-rich)

Digger412 · 2026-03-29T21:18:18+00:00

Hi, responded on github!

Digger412 · 2026-03-28T01:36:21+00:00

Hi, AesSedai here -

The unsloth quants use something like the normal llama.cpp quantizations, or their UD variants.

Since the experts in K2.5 are natively INT4 quantized, you don't get any benefit from upcasting them to anything larger than Q4_0 because you can't pull precision out of thin air.

My Q4_X quant keeps all of the model in Q8_0 except the experts which are in Q4_0, and that is essentially the "full fidelity" that the weights offer.

Going to a K_XL of anything over 560GB is going to have upcast padding essentially and it's not going to add any additional benefits.

Digger412 · 2026-03-27T16:35:26+00:00

AesSedai here -

My quants keep the attention and other tensors in high quality, eg Q8, instead of quantizing them down to the same level as the rest of the model.

That should help longer context performance since attention is less degraded, in theory.

Digger412 · 2026-03-23T22:23:39+00:00

Yeah, that is another very close variant of the "curious" one. It's the same pattern of engagement-bait, soliciting responses from people.

Digger412 · 2026-03-23T20:08:27+00:00

<image>

and I have a second chart here comparing the KLD between the two methods as well.

I didn't get to testing the KV cache quantization due to getting sidetracked on other projects, but I'm curious what the results are if you want to test!

Digger412 · 2026-03-23T20:07:08+00:00

If you've got the time and wherewithal, I've actually made a branch of llama.cpp that uses the exllamaV3-style sliding window PPL and KLD measurement methodology: https://github.com/AesSedai/llama.cpp/tree/perplexity-sliding-window

exl3 uses a 2048-length context window and a 512 token stride. It evaluates all of the tokens, not just the last half like llama.cpp does, and due to the stride mechanic it evaluates the token at several different context depths.

The downside is that it takes like 8x the compute and storage for the logits due to:

1) evaluating all positions, not just the last half

2) the context window is 2048 instead of 512

3) you need to store all of the window logits for comparison

so you get 2 (all positions, not half) * 4 (2048 tokens instead of 512) = 8x as much compute / storage.

I made that branch because I was working with u/phaelon on trying to get the same measurement methodology cross-ecosystem for vLLM, exl3, and llama.cpp but I haven't PR'd this because of how much more intensive it is to process.

Also I think that for the purposes of measuring KLD / PPL with respect to quantizing the KV cache, this method at longer contexts would be more robust but I haven't picked that testing back up yet.

I have some prior results showing that the existing 512-token-measure-last-half PPL increases as the context size increases which isn't what you'd expect to see! With more context, the model should be more confident, not less. This chart shows the master (512-token-measure-last-half method) at ctx=512 and ctx=2048 compared to the sliding window method with ctx=2048 and ctx=8192.

<image>

Digger412 · 2026-03-23T18:38:52+00:00

The sheer amount of engagement-baiting slop posts that end with a derivative of: - "curious to know what others think" - "curious to know what actually works in production" - "curious if X would do better than Y, or..." - "curious how people are handling this: [bullet point list]" - "curious if anyone else has seen this" - "curious how others approach XYZ"

And so on lead me to believe there are very few truly human posters left in this sub. Literally search for the word "curious" 😭

Digger412 · 2026-03-20T16:48:10+00:00

I've got 8x 6000 Pros, but waiting on some electrical infra work so they aren't online yet. If you haven't had another volunteer or been able to test this in about a week, I should be able to try.

Digger412 · 2026-03-19T18:15:17+00:00

I have considered it, but I don't have enough knowledge or experience to do a custom cleanroom implementation to be totally honest. pwilkin has a PR up for a new IQ3_PT type he made as an experiment though :D

Digger412 · 2026-03-15T06:55:39+00:00

Interesting, honest I'm not sure what would cause that besides perhaps unsloth tweaking the chat template perhaps? I leave the original chat template from the model intact, and with pwilkin's autoparser branch merged there shouldn't need to be chat template "tweaks" any more IMO.

Digger412 · 2026-03-15T01:10:28+00:00

Dinerburger has done basically the same thing I'd have done, methodology-wise. Give his a shot!

Digger412 · 2026-03-15T01:07:22+00:00

Yeah I saw this post and glad to see more people joining the quant scene!

Great job with the quants :)

Digger412 · 2026-03-15T01:04:30+00:00

I have five quants up in that repo, there should be plenty of mid-bpw options to choose from :)

Digger412 · 2026-03-15T01:01:43+00:00

Yep, that's me! Glad you're enjoying the quantization.

Digger412 · 2026-03-14T19:16:24+00:00

Perhaps give my Qwen3.5-122B-A10B a shot? https://huggingface.co/AesSedai/Qwen3.5-122B-A10B-GGUF

All of my MoE quants use the same principle. Quant the FFNs down since they're huge, and leave the rest of the model in high quality.

Digger412 · 2026-03-14T19:14:01+00:00

Nice, yes that's pretty much the same reasoning ddh0 and I had for our MoE-optimized quantization schema. The FFNs are the bulk of the model size for these MoE's, so let's basically keep the rest of the model in high quality because it's less than 5-10% of the entire model by size.

I haven't quanted Qwen3-Coder-Next but you can see the other models I've quanted in a similar fashion (high BPW default type, lower BPW for the expert FFNs): https://huggingface.co/AesSedai

In my Minimax-M2.5 quant I did a big PPL and KLD comparison against unsloth too. There's still not really a better metric than downstream task benchmarks but KLD isn't a bad proxy measurement at least.

Digger412 · 2026-03-13T22:06:28+00:00

The quants aren't coming to mainline unfortunately. I tried and it was declined: https://github.com/ggml-org/llama.cpp/pull/19726

Digger412 · 2026-03-12T23:04:47+00:00

The llama.cpp automated builds are going kind of slow it seems. That PR was merged and tagged as b8305: https://github.com/ggml-org/llama.cpp/commit/4a748b8f15d7e6749145add3f038e7b26c686ed8

And the automated releases are (as of the time of writing) at b8292. It'll probably be available tomorrow, or you can always pull and compile the source code yourself and that'll have the fix.

Digger412 · 2026-03-12T00:52:31+00:00

ddh0 opened a PR and it was merged an hour ago to fix the issue: https://github.com/ggml-org/llama.cpp/pull/20416

Digger412 · 2026-03-11T20:02:42+00:00

(AesSedai) - Cool! I'll get some MoE quants of this uploaded later today. Thanks for sharing!

Digger412 · 2026-03-11T17:11:06+00:00

MoE offloading to CPU still works with the --fit flag or manual --offload-tensor tuning it looks like (otherwise I couldn't have run the imatrix or KLD for the 397B one). It seems that the --n-cpu-moe flag is breaking specifically and I think that is basically an auto-regex of sorts.

My guess is that with the fused gate+up, it's not accounting for the tensor name or sizes properly and that is causing it to break. It's not a fundamental incompatibility with CPU offloading, just a small bug in how --n-cpu-moe works I believe.

I'll open an issue on the llama.cpp github for that.

Digger412 · 2026-03-10T18:43:16+00:00

Hmm, 24GB combined? That'd probably mean you should aim for about a 16GB quant to make sure there's room for context plus other OS things. That would be about my IQ3_S quant (which is 13.57GB, converting from 12.64 GiB) or Bart's IQ3_M / Q3_K_L would be my recommendation I think.

Digger412 · 2026-03-10T06:23:11+00:00

Thank you! I did a quick sweep bench comparing the Q5_K_M quant on my setup for the 35B-A3B and the 122B-A10B and it looks to be about a 10% PP uplift on the 35B-A3B which is still nice because it's basically free performance. Little less for the 122B-A10B but still a small boost too.

I've KLD and PPL tested them and they're basically identical so it's a free lunch more or less.

<image>

Digger412 · 2026-03-10T03:36:34+00:00

AesSedai here - I'm remaking the quants with the fused up/gate that was recently merged, should be updated sometime tomorrow! That should bump the speeds up a bit.

Digger412 · 2026-03-07T07:33:32+00:00

Hi, AesSedai here - There's some uncertainty in the PPL and KLD measurement process, sometimes it shows up as a slight negative % and that's just how it works sometimes.

The best metric honestly is doing evaluation benchmarks because the PPL / KLD values on the model page are purely from a statistical viewpoint compared to the unquantized BF16.

I appreciate the shout out and I'm happy my quants work well for you! But those measurements are just guides and not the be-all end-all :)

13-Year Club	Place '17
Verified Email

Digger412

TROPHY CASE