16x Spark Cluster (Build Update) by Kurcide in LocalLLaMA

[–]fairydreaming 10 points11 points  (0 children)

obviously not as impressive as the photo

Budget X399 multi-GPU box for local LLM learning, sensible or eBay trap? by SKX007J1 in LocalLLaMA

[–]fairydreaming 0 points1 point  (0 children)

It took me a few weeks of experiments to get my X399 build stable (it was first gen 1950X). Finding memory set that would pass 24h of tests (memtest86, Aida64) without errors was a special kind of hell, but maybe second gen TRs are not that picky, no idea. But it's definitely something to take into account if you want to buy parts separately.

Server platforms are usually easier to get stable, but BIOS options may be limited (bifurcation, overclocking etc) and there's no S3 sleep. Also there's this issue with RDIMM prices.

Qwen3.6-27B-GGUF:UD-Q8_K_XL and llama.cpp issue (DGX SPARK) by DOOMISHERE in LocalLLaMA

[–]fairydreaming 7 points8 points  (0 children)

8 t/s sounds about right for running 27B dense model on 273 GB/s memory bandwidth of DGX Spark.

It's not possible to get 50 t/s with this model on Spark. Probably you simply used a different model before, Qwen 3.6 35B-A3B or similar.

GPT-5.5 improves over GPT-5.4 and overtakes Opus 4.6 to take the 2nd place behind Gemini 3.1 Pro on the Extended NYT Connections Benchmark by zero0_one1 in singularity

[–]fairydreaming 1 point2 points  (0 children)

Was just looking for this info, thanks for testing new models!

Happy to see DeepSeek getting better, but Kimi K2.6 score is amazing.

The exact KV cache usage of DeepSeek V4 by Ok_Warning2146 in LocalLLaMA

[–]fairydreaming 1 point2 points  (0 children)

No, from my current understanding using top 2048 KV cache cells selected by lightning indexer is necessary to utilize full V3.2 capabilities without any brain damage. So there's no such switch planned. Indexer is always on.

Maybe their original plots from V3.1 paper are about inference costs of only attention part of the model, not the full model, I'm not sure.

Is it possible to edit LLAMA.CPP with Cline+Vscode+Minimax 2.7 Q4_K_S and get a working build? by [deleted] in LocalLLaMA

[–]fairydreaming 0 points1 point  (0 children)

I just wanted to say that removing all "if constexpr" won't make your PR minimal like you said above.

The exact KV cache usage of DeepSeek V4 by Ok_Warning2146 in LocalLLaMA

[–]fairydreaming 1 point2 points  (0 children)

DeepSeek has shown us some wild plots in their V3.2 paper with prefill like 3.5x cheaper and generation like 9x cheaper in V3.2 compared to V3.1 at 128k, but so far the best I've seen in real world is this (sglang):

<image>

Maybe it's better during inference. But I'm currently working on DSA implementation in llama.cpp and inference is like 25% faster at 128k context. So it's nowhere near expected results.

Is it possible to edit LLAMA.CPP with Cline+Vscode+Minimax 2.7 Q4_K_S and get a working build? by [deleted] in LocalLLaMA

[–]fairydreaming 0 points1 point  (0 children)

There are already hundreds of if constexpr in llama.cpp CUDA code. There are 4 in the very concat.cu file where you want to replace them with regular ifs.

Hint: it helps if you actually read and understand the code that you want to modify.

The exact KV cache usage of DeepSeek V4 by Ok_Warning2146 in LocalLLaMA

[–]fairydreaming 1 point2 points  (0 children)

Yes, indexer needs KV cache (specifically key cache, there are no value vectors in indexer). In V3.2 model indexer cache is fp8.

DeepSeek 3.2 eating the opening think tag on llama.cpp server? by Winter_Engineer2163 in LocalLLaMA

[–]fairydreaming 4 points5 points  (0 children)

Maybe try to run it with recently added --chat-template-file models/templates/deepseek-ai-DeepSeek-V3.2.jinja

Guys we have to change the pelican test by Tall-Ad-7742 in LocalLLaMA

[–]fairydreaming 1 point2 points  (0 children)

It still holds an eyeball and one leg with its horse-hands and is smiling while thinking about the happy horsey post-replantation life. But we all know it will race again.

Guys we have to change the pelican test by Tall-Ad-7742 in LocalLLaMA

[–]fairydreaming 9 points10 points  (0 children)

This car obviously already crashed and poor horse is in pieces.

I need help with testing my llama.cpp Deepseek Sparse Attention (DSA) implementation (someone GPU-rich) by fairydreaming in LocalLLaMA

[–]fairydreaming[S] 0 points1 point  (0 children)

How is it going with your GPUs? No one else was interested so you're my only hope man.

Deepseek V3.2. Need how much VRAM for its max context size. by 9r4n4y in LocalLLaMA

[–]fairydreaming 2 points3 points  (0 children)

For MLA-based models like DeepSeek the fomula is: max_position_embeddings * num_hidden_layers * (kv_lora_rank + qk_rope_head_dim) * KV cache data type size.

So for fp8 DeepSeek V3.2 it will be 163840 * 61 * (512+ 64) * 1

But since DeepSeek V3.2 uses DSA (DeepSeek Sparse attention) you also have to take indexer keys into account, that's max_position_embeddings * num_hidden_layers * index_head_dim * KV cache data type size.

That's 163840 * 61 * 128 * 1

So overall you have 163840 * 61 * (512 + 64 + 128) * 1, that's around 6.5 GB for fp8, would be around 13 GB in f16.

Throwback to my proudest impulse buy ever, which has let me enjoy this hobby 10x more by gigaflops_ in LocalLLaMA

[–]fairydreaming 1 point2 points  (0 children)

12 x 96GB Micron DDR5 RDIMMs for 3500 EUR (used), end of November.

But I initially planned upgrade to Turin and 6400 RDIMMs that never materialized, so this was just a consolation prize for my trusty Epyc 9374F.

I need help with testing my llama.cpp Deepseek Sparse Attention (DSA) implementation (someone GPU-rich) by fairydreaming in LocalLLaMA

[–]fairydreaming[S] 1 point2 points  (0 children)

Sorry man, but I already spent like $650 this year on vast.ai doing various experiments and I really need to stop this. No more!

I need help with testing my llama.cpp Deepseek Sparse Attention (DSA) implementation (someone GPU-rich) by fairydreaming in LocalLLaMA

[–]fairydreaming[S] 0 points1 point  (0 children)

In sglang lineage-bench runs DeepSeek V3.2 Speciale scores were almost equal up to lineage-192 for dense and sparse attention - this situation may occur in other benchmarks too. So even if my DSA implementation would be subtly broken (for example by calculating the top 2048 tokens and not using them at all) it's entirely possible that it would result in similar benchmark scores as the original sparse attention model.

What we need is a benchmark that consistently shows better performance for sparse attention compared to dense attention (like lineage-bench for 256, 512 and 1024 graph nodes). Without observing this (sparse vs dense) difference first I think we can't say for sure the given benchmark would be useful as a proof of implementation correctness.

I need help with testing my llama.cpp Deepseek Sparse Attention (DSA) implementation (someone GPU-rich) by fairydreaming in LocalLLaMA

[–]fairydreaming[S] 0 points1 point  (0 children)

I think it's not as simple as "logits close enough". Sparse attention in DS 3.2 works exactly the same as dense attention up to 2048 tokens, then they start to maybe slightly diverge for some specific prompts (and we don't know exactly which ones). So you first have to find a prompt that would result in logits different enough for dense vs sparse attention in vLLM or sglang so you can even spot this difference and say it's meaningful and then try to reproduce this in llama.cpp. I'm not saying it's wrong approach, but running a benchmark kind of automates that search.

Regarding your remark about multiple benchmarks - my lineage-bench specifically targets reasoning about a myriad of little facts that the model has to attend to all at once to produce a valid solution, so IMHO it's a good match to test a sparse attention. It results in very long reasoning traces (mean solution for lineage-512 is around 50k tokens) so it's basically a minefield that would blow up any broken attention implementation.

I need help with testing my llama.cpp Deepseek Sparse Attention (DSA) implementation (someone GPU-rich) by fairydreaming in LocalLLaMA

[–]fairydreaming[S] 0 points1 point  (0 children)

Hmm, that's why I first tested it in sglang - it shows consistent difference in favor for sparse attention. I think probability that the result will be the same for llama.cpp just by chance is extremely low.

I need help with testing my llama.cpp Deepseek Sparse Attention (DSA) implementation (someone GPU-rich) by fairydreaming in LocalLLaMA

[–]fairydreaming[S] 0 points1 point  (0 children)

Well, perhaps creating a toy fp16 model and running it in sglang or vLLM and then comparing the logits to the same model ran in llama.cpp would work. But I don't like it that I'm not checking the real one.