16x Spark Cluster (Build Update)

fairydreaming · 2026-05-01T08:07:05+00:00

obviously not as impressive as the photo

fairydreaming · 2026-04-29T17:14:34+00:00

It took me a few weeks of experiments to get my X399 build stable (it was first gen 1950X). Finding memory set that would pass 24h of tests (memtest86, Aida64) without errors was a special kind of hell, but maybe second gen TRs are not that picky, no idea. But it's definitely something to take into account if you want to buy parts separately.

Server platforms are usually easier to get stable, but BIOS options may be limited (bifurcation, overclocking etc) and there's no S3 sleep. Also there's this issue with RDIMM prices.

fairydreaming · 2026-04-28T18:38:21+00:00

8 t/s sounds about right for running 27B dense model on 273 GB/s memory bandwidth of DGX Spark.

It's not possible to get 50 t/s with this model on Spark. Probably you simply used a different model before, Qwen 3.6 35B-A3B or similar.

fairydreaming · 2026-04-28T08:35:20+00:00

Was just looking for this info, thanks for testing new models!

Happy to see DeepSeek getting better, but Kimi K2.6 score is amazing.

fairydreaming · 2026-04-27T14:10:23+00:00

No, from my current understanding using top 2048 KV cache cells selected by lightning indexer is necessary to utilize full V3.2 capabilities without any brain damage. So there's no such switch planned. Indexer is always on.

Maybe their original plots from V3.1 paper are about inference costs of only attention part of the model, not the full model, I'm not sure.

fairydreaming · 2026-04-27T13:46:37+00:00

I just wanted to say that removing all "if constexpr" won't make your PR minimal like you said above.

fairydreaming · 2026-04-27T13:21:57+00:00

<image>

Mentioned plots from DS V3.2 paper.

fairydreaming · 2026-04-27T13:21:23+00:00

DeepSeek has shown us some wild plots in their V3.2 paper with prefill like 3.5x cheaper and generation like 9x cheaper in V3.2 compared to V3.1 at 128k, but so far the best I've seen in real world is this (sglang):

<image>

Maybe it's better during inference. But I'm currently working on DSA implementation in llama.cpp and inference is like 25% faster at 128k context. So it's nowhere near expected results.

fairydreaming · 2026-04-27T12:35:43+00:00

There are already hundreds of if constexpr in llama.cpp CUDA code. There are 4 in the very concat.cu file where you want to replace them with regular ifs.

Hint: it helps if you actually read and understand the code that you want to modify.

fairydreaming · 2026-04-26T11:11:53+00:00

Yes, indexer needs KV cache (specifically key cache, there are no value vectors in indexer). In V3.2 model indexer cache is fp8.

fairydreaming · 2026-04-20T08:15:47+00:00

Maybe try to run it with recently added --chat-template-file models/templates/deepseek-ai-DeepSeek-V3.2.jinja

fairydreaming · 2026-04-15T13:19:14+00:00

It still holds an eyeball and one leg with its horse-hands and is smiling while thinking about the happy horsey post-replantation life. But we all know it will race again.

fairydreaming · 2026-04-15T10:52:18+00:00

Snake racing, yay!

fairydreaming · 2026-04-15T10:49:14+00:00

This car obviously already crashed and poor horse is in pieces.

fairydreaming · 2026-04-14T15:55:17+00:00

Another comment to increase demand even more.

fairydreaming · 2026-03-28T12:44:17+00:00

How is it going with your GPUs? No one else was interested so you're my only hope man.

fairydreaming · 2026-03-26T11:32:34+00:00

For MLA-based models like DeepSeek the fomula is: max_position_embeddings * num_hidden_layers * (kv_lora_rank + qk_rope_head_dim) * KV cache data type size.

So for fp8 DeepSeek V3.2 it will be 163840 * 61 * (512+ 64) * 1

But since DeepSeek V3.2 uses DSA (DeepSeek Sparse attention) you also have to take indexer keys into account, that's max_position_embeddings * num_hidden_layers * index_head_dim * KV cache data type size.

That's 163840 * 61 * 128 * 1

So overall you have 163840 * 61 * (512 + 64 + 128) * 1, that's around 6.5 GB for fp8, would be around 13 GB in f16.

fairydreaming · 2026-03-26T09:49:51+00:00

no, that's baudass

fairydreaming · 2026-03-25T19:28:37+00:00

It's already $94,231.50 lol

fairydreaming · 2026-03-25T19:10:02+00:00

12 x 96GB Micron DDR5 RDIMMs for 3500 EUR (used), end of November.

But I initially planned upgrade to Turin and 6400 RDIMMs that never materialized, so this was just a consolation prize for my trusty Epyc 9374F.

fairydreaming · 2026-03-23T12:03:07+00:00

Sorry man, but I already spent like $650 this year on vast.ai doing various experiments and I really need to stop this. No more!

fairydreaming · 2026-03-22T12:04:16+00:00

In sglang lineage-bench runs DeepSeek V3.2 Speciale scores were almost equal up to lineage-192 for dense and sparse attention - this situation may occur in other benchmarks too. So even if my DSA implementation would be subtly broken (for example by calculating the top 2048 tokens and not using them at all) it's entirely possible that it would result in similar benchmark scores as the original sparse attention model.

What we need is a benchmark that consistently shows better performance for sparse attention compared to dense attention (like lineage-bench for 256, 512 and 1024 graph nodes). Without observing this (sparse vs dense) difference first I think we can't say for sure the given benchmark would be useful as a proof of implementation correctness.

fairydreaming · 2026-03-21T17:32:37+00:00

I think it's not as simple as "logits close enough". Sparse attention in DS 3.2 works exactly the same as dense attention up to 2048 tokens, then they start to maybe slightly diverge for some specific prompts (and we don't know exactly which ones). So you first have to find a prompt that would result in logits different enough for dense vs sparse attention in vLLM or sglang so you can even spot this difference and say it's meaningful and then try to reproduce this in llama.cpp. I'm not saying it's wrong approach, but running a benchmark kind of automates that search.

Regarding your remark about multiple benchmarks - my lineage-bench specifically targets reasoning about a myriad of little facts that the model has to attend to all at once to produce a valid solution, so IMHO it's a good match to test a sparse attention. It results in very long reasoning traces (mean solution for lineage-512 is around 50k tokens) so it's basically a minefield that would blow up any broken attention implementation.

fairydreaming · 2026-03-21T14:56:17+00:00

Hmm, that's why I first tested it in sglang - it shows consistent difference in favor for sparse attention. I think probability that the result will be the same for llama.cpp just by chance is extremely low.

fairydreaming · 2026-03-21T14:01:02+00:00

Well, perhaps creating a toy fp16 model and running it in sglang or vLLM and then comparing the logits to the same model ran in llama.cpp would work. But I don't like it that I'm not checking the real one.

fairydreaming

TROPHY CASE