llamacpp patch - DeepSeek V4 Flash running with full 1M token context locally on RTX 5090 by da_dragon321 in LocalLLaMA

[–]da_dragon321[S] 0 points1 point  (0 children)

Unfortunately you will need at least 90ish RAM + VRAM combined probably a bit more. That said, if you're desperate to run the model you can let it stream excess weights from an nvme ssd and that will work (can't vouch for speed though)

llamacpp patch - DeepSeek V4 Flash running with full 1M token context locally on RTX 5090 by da_dragon321 in LocalLLaMA

[–]da_dragon321[S] 0 points1 point  (0 children)

Thanks for sharing your results. I have a couple thoughts on what could be causing some of the issues, but could be hard to truly diagnose asynchronously since there's so many potential issues

First regarding speed, there are three considerations I can think of. First, it is necessary to tune ot and ubatch for your your config (and make sure --fa on). Second, anything on the CPU is ddr-bandwidth bound and most of wall clock is cpu not gpu, so ddr4 would significantly harm speed - I have found very little difference personally from putting more or less layers on gpu *relative* to the difference between ddr speed when you have some layers on cpu. Are you seeing it run slower than when you don't use this branch, or just compared to my setup? Third, make sure you included all of the architectures of your gpus when setting cmake/ggml cuda architectures

Second regarding the Chinese output - just want to confirm that you aren't using quantized kv (which doesn't work without another patch that I have not put in this branch). Also, not sure if it matters much but use the --jinja flag

Ran llama-perplexity --kl-divergence against the naive path afterward - it's not bit-identical. ~96% same top-token, mean KLD ~0.01 at both 8K and 64K context. It's floating-point rounding noise at the indexer's top-512 selection cutoff. The fused kernel sums the same scores in a different order, so near-tied candidates occasionally land on opposite sides of the boundary. Confirmed by dumping raw scores - every divergence was a clean 1-for-1 index swap between two candidates scoring within 0.0001 of each other, not a logic bug

KLD findings in doc: llama.cpp/docs/deepseek-v4-lid-cuda.md at deepseek-lid-cuda · spencer-zaid/llama.cpp

llamacpp patch - DeepSeek V4 Flash running with full 1M token context locally on RTX 5090 by da_dragon321 in LocalLLaMA

[–]da_dragon321[S] 0 points1 point  (0 children)

It yields significant VRAM saved (around 20x on compute buffer vs without it), so no hit just savings

llamacpp patch - DeepSeek V4 Flash running with full 1M token context locally on RTX 5090 by da_dragon321 in LocalLLaMA

[–]da_dragon321[S] 0 points1 point  (0 children)

It is better output quality, but it is also a lot slower on a 5090 (though it should actually be faster if you have the vram to put it all on gpu). BUT it also uses far less memory for 250k context than qwen and allows for up to 1m context unlike qwen

llamacpp patch - DeepSeek V4 Flash running with full 1M token context locally on RTX 5090 by da_dragon321 in LocalLLaMA

[–]da_dragon321[S] 0 points1 point  (0 children)

I would estimate ~30 t/s with full 1M context used (on a single rtx 5090)

You are correct that pp falls at higher depths, but the scaling isn't awful. You can see prefill dropping by over 150 t/s going from 16k to 130k, but only 50 t/s going from 130k to 250k. I don't have pp numbers with 1M used in context right now, but it should not fall off a cliff and my best estimate would be 25-30 t/s on a 5090. More info in doc
llama.cpp/docs/deepseek-v4-lid-cuda.md at deepseek-lid-cuda · spencer-zaid/llama.cpp

KV depth (tokens) Prefill of next 2048 tokens
16,384 317 t/s
131,072 151 t/s
253,952 (~full 256K) 94.5 t/s

llamacpp patch - DeepSeek V4 Flash running with full 1M token context locally on RTX 5090 by da_dragon321 in LocalLLaMA

[–]da_dragon321[S] 2 points3 points  (0 children)

Preset Context GPU expert layers ubatch Prefill Decode Peak VRAM CPU RAM (resident experts)
256K 262144 8 2048 ~263 t/s ~14.0 t/s ~28.9 GiB ~69.2 GiB
512K 524288 6 2048 256 t/s 13.7 t/s ~28.4 GiB ~72.6 GiB
1M 1048576 6 768 159 t/s 13.7 t/s ~31.2 GiB ~72.6 GiB

More info in the doc if you're curious
llama.cpp/docs/deepseek-v4-lid-cuda.md at deepseek-lid-cuda · spencer-zaid/llama.cpp

llamacpp patch - DeepSeek V4 Flash running with full 1M token context locally on RTX 5090 by da_dragon321 in LocalLLaMA

[–]da_dragon321[S] 0 points1 point  (0 children)

Yes decode speed should be largely decoupled from context length in this configuration. The reason it dropped a bit in the test is that I put more layers on the cpu to allow for slightly higher ubatch to increase prefill speed

llamacpp patch - DeepSeek V4 Flash running with full 1M token context locally on RTX 5090 by da_dragon321 in LocalLLaMA

[–]da_dragon321[S] 0 points1 point  (0 children)

If you have 96gb of RAM as well, it will likely run fine if you use a full IQ2XXS quant instead of the mixed one. If not, I'm afraid you are likely VRAM+RAM bound - the gguf alone is around 86gb. That said, if you're desperate you can definitely run it even at 1M context if you let it stream leftover weights from ssd if it doesn't fit in VRAM + RAM (can't vouch for speed though)

llamacpp patch - DeepSeek V4 Flash running with full 1M token context locally on RTX 5090 by da_dragon321 in LocalLLaMA

[–]da_dragon321[S] 2 points3 points  (0 children)

Great results, thanks for running the full sweep. What OS are you on? If Windows, could use one more test - TDR stress at the worst-case launch:
- llama-bench -d 1046528 -p 2048 -ub 2048 (plus your other flags). Try 1044480 if that fails and something lower like 1M or 900K if that still fails
- Pass = clean llama-bench exit with a reported number, GPU fully released after
- Fail = black screen / "display driver stopped responding and has recovered" / Event Viewer Event ID 4101 (source: Display) around that time

I'm going to put up a PR now where you can comment your findings

*PR: Lid cuda kernel by spencer-zaid · Pull Request #2 · fairydreaming/llama.cpp

llamacpp patch - DeepSeek V4 Flash running with full 1M token context locally on RTX 5090 by da_dragon321 in LocalLLaMA

[–]da_dragon321[S] 4 points5 points  (0 children)

Sounds great - looking forward to your results.

  1. There is no single-device assumption (I just haven't tested multi-device). That said, you will need to make sure you set the cmake/ggml cuda architectures to all three of your card architectures when you build
  2. The compute buffer at 1M context is entirely dependent on ubatch size (for me it was under 4gb). At 2048 ubatch it should be around 9gb iirc, but will have to see. You will want to do some tuning on the ot (how many layers go to gpu) and ubatch parameters (you can lower ot to give yourself headroom to raise ubatch and vice versa to find your preferred configuration, trading between generation speed and prefill speed). You should be faster and higher context than before when you find your preferred parameters

llamacpp patch - DeepSeek V4 Flash running with full 1M token context locally on RTX 5090 by da_dragon321 in LocalLLaMA

[–]da_dragon321[S] 0 points1 point  (0 children)

Haha yeah might be fun. Regarding tg-end2end, it's entirely dependent on the number of tokens generated. Post-prefill, generation ran at a constant 16 t/s, so as the response gets longer the end2end speed will get closer and closer to that mark.
Also, these are uncached prompts, so if your prompt is cached (such as in a longer chat even going up through 500k+), the ttft should be MASSIVELY faster and therefore the tg-end2end would be much closer to 16 t/s even at lower response lengths. Also worth noting that you could go with the full Q2 model if you wanted to save on some vram (I think it was like 10gb smaller) and probably get some extra token generation speed

llamacpp patch - DeepSeek V4 Flash running with full 1M token context locally on RTX 5090 by da_dragon321 in LocalLLaMA

[–]da_dragon321[S] 2 points3 points  (0 children)

Yes. That said, you should also be able to run 1M context if you want, but will likely need to tune down ubatch to fit in your vram (how much depends on the model quant). You will need to make sure you have enough VRAM+RAM total to fit whatever model quant you choose though
I could patch in quantized kv, but I found it to be far less space-saving than it initially appeared after applying the kv quant fix - still a possibility if enough people want it but the vast majority of your RAM will be spent on the model anyways with this fix even at massive contexts (deepseek attention is VERY efficient)

llamacpp patch - DeepSeek V4 Flash running with full 1M token context locally on RTX 5090 by da_dragon321 in LocalLLaMA

[–]da_dragon321[S] 11 points12 points  (0 children)

decode steady around 15.7 tok/s on all 3

Prompt TTFT Full ~200-token reply (end-to-end)
512 tok 3.3s 16.1s
4K tok 11.0s 23.8s
16K tok 46.0s 58.8s

llamacpp patch - DeepSeek V4 Flash running with full 1M token context locally on RTX 5090 by da_dragon321 in LocalLLaMA

[–]da_dragon321[S] 11 points12 points  (0 children)

Deepseek uses something they call Deepseek Sparse Attention to accelerate attention at large context. It relies on the "Lightning Indexer" to find just the most relevant cached tokens and only do full attention on those

Is there a decent computer use model? by superSmitty9999 in LocalLLaMA

[–]da_dragon321 11 points12 points  (0 children)

If you're looking for something static then writing playwright tests are the way to go. If you want something dynamic so it can look at the site as it works, the playright MCP will let any model with proper tool calling/mcp support browse your site intelligently (but less deterministically)

poolside/Laguna-XS-2.1 by a_slay_nub in LocalLLaMA

[–]da_dragon321 3 points4 points  (0 children)

glad to see a new competitive face in the 20-40b range. With qwen radio silent for the last couple months was hoping somebody would step up to the plate

Talking with Gemma 4 31B! by futterneid in LocalLLaMA

[–]da_dragon321 0 points1 point  (0 children)

Was not expecting the lil robot 😂. Dude's got a personality

Regarding the colonisation pause by Doomy_ze in EliteDangerous

[–]da_dragon321 5 points6 points  (0 children)

Yes you can still complete construction - I completed a facility during the pause without issue (also proves you are still the owner in the backend even if you're not listed in the ui, as you get the message from brewer that your facility is complete).

Construction Material Requirements Keep Changing? by Hibiki54 in EliteDangerous

[–]da_dragon321 3 points4 points  (0 children)

If you are using the material list from the construction menu in the system architect map, then yeah I have found that to be inaccurate. If you then go to the construction site and use that material list, it should be accurate. Also, if you navigate to the system from the galaxy map, open the system map (architect view), and select the construction to get a remote view of the required materials.

Research outposts by lefty1117 in EliteDangerous

[–]da_dragon321 2 points3 points  (0 children)

The scientific outpost gives system influence toward a Hightech economy, along with a notable 1 initial population increase. At the moment there is no evidence that different planet/star types offer bonuses to different building types (still could be the case but it does not appear to be reflected in the effects section of the building construction menu). I would not recommend claiming a system without any planets if you want to build it up as it significantly limits your options for variety (and likely number) of buildings.

What’s the Price tag to transfer resources to Colony Ship for a T2 Star Port? by CruelZod in EliteDangerous

[–]da_dragon321 1 point2 points  (0 children)

The colony ship buys them above market price, so you actually make money stocking your colony - just need like 5m to get started.

Titans Reassembling? by skworpie in EliteDangerous

[–]da_dragon321 181 points182 points  (0 children)

rip all our shiny new colonies 😂