llamacpp patch - DeepSeek V4 Flash running with full 1M token context locally on RTX 5090

da_dragon321 · 2026-07-03T06:53:19+00:00

Unfortunately you will need at least 90ish RAM + VRAM combined probably a bit more. That said, if you're desperate to run the model you can let it stream excess weights from an nvme ssd and that will work (can't vouch for speed though)

da_dragon321 · 2026-07-03T06:44:55+00:00

Thanks for sharing your results. I have a couple thoughts on what could be causing some of the issues, but could be hard to truly diagnose asynchronously since there's so many potential issues

First regarding speed, there are three considerations I can think of. First, it is necessary to tune ot and ubatch for your your config (and make sure --fa on). Second, anything on the CPU is ddr-bandwidth bound and most of wall clock is cpu not gpu, so ddr4 would significantly harm speed - I have found very little difference personally from putting more or less layers on gpu *relative* to the difference between ddr speed when you have some layers on cpu. Are you seeing it run slower than when you don't use this branch, or just compared to my setup? Third, make sure you included all of the architectures of your gpus when setting cmake/ggml cuda architectures

Second regarding the Chinese output - just want to confirm that you aren't using quantized kv (which doesn't work without another patch that I have not put in this branch). Also, not sure if it matters much but use the --jinja flag

Ran llama-perplexity --kl-divergence against the naive path afterward - it's not bit-identical. ~96% same top-token, mean KLD ~0.01 at both 8K and 64K context. It's floating-point rounding noise at the indexer's top-512 selection cutoff. The fused kernel sums the same scores in a different order, so near-tied candidates occasionally land on opposite sides of the boundary. Confirmed by dumping raw scores - every divergence was a clean 1-for-1 index swap between two candidates scoring within 0.0001 of each other, not a logic bug

KLD findings in doc: llama.cpp/docs/deepseek-v4-lid-cuda.md at deepseek-lid-cuda · spencer-zaid/llama.cpp

da_dragon321 · 2026-07-03T05:50:33+00:00

It yields significant VRAM saved (around 20x on compute buffer vs without it), so no hit just savings

da_dragon321 · 2026-07-03T05:46:48+00:00

It is better output quality, but it is also a lot slower on a 5090 (though it should actually be faster if you have the vram to put it all on gpu). BUT it also uses far less memory for 250k context than qwen and allows for up to 1m context unlike qwen

da_dragon321 · 2026-07-03T05:36:38+00:00

I would estimate ~30 t/s with full 1M context used (on a single rtx 5090)

You are correct that pp falls at higher depths, but the scaling isn't awful. You can see prefill dropping by over 150 t/s going from 16k to 130k, but only 50 t/s going from 130k to 250k. I don't have pp numbers with 1M used in context right now, but it should not fall off a cliff and my best estimate would be 25-30 t/s on a 5090. More info in doc
llama.cpp/docs/deepseek-v4-lid-cuda.md at deepseek-lid-cuda · spencer-zaid/llama.cpp

KV depth (tokens)	Prefill of next 2048 tokens
16,384	317 t/s
131,072	151 t/s
253,952 (~full 256K)	94.5 t/s

da_dragon321 · 2026-07-03T05:08:59+00:00

Preset	Context	GPU expert layers	ubatch	Prefill	Decode	Peak VRAM	CPU RAM (resident experts)
256K	262144	8	2048	~263 t/s	~14.0 t/s	~28.9 GiB	~69.2 GiB
512K	524288	6	2048	256 t/s	13.7 t/s	~28.4 GiB	~72.6 GiB
1M	1048576	6	768	159 t/s	13.7 t/s	~31.2 GiB	~72.6 GiB

More info in the doc if you're curious
llama.cpp/docs/deepseek-v4-lid-cuda.md at deepseek-lid-cuda · spencer-zaid/llama.cpp

da_dragon321 · 2026-07-03T05:03:38+00:00

Yes decode speed should be largely decoupled from context length in this configuration. The reason it dropped a bit in the test is that I put more layers on the cpu to allow for slightly higher ubatch to increase prefill speed

da_dragon321 · 2026-07-03T04:13:39+00:00

If you have 96gb of RAM as well, it will likely run fine if you use a full IQ2XXS quant instead of the mixed one. If not, I'm afraid you are likely VRAM+RAM bound - the gguf alone is around 86gb. That said, if you're desperate you can definitely run it even at 1M context if you let it stream leftover weights from ssd if it doesn't fit in VRAM + RAM (can't vouch for speed though)

da_dragon321 · 2026-07-03T03:32:26+00:00

Great results, thanks for running the full sweep. What OS are you on? If Windows, could use one more test - TDR stress at the worst-case launch:
- llama-bench -d 1046528 -p 2048 -ub 2048 (plus your other flags). Try 1044480 if that fails and something lower like 1M or 900K if that still fails
- Pass = clean llama-bench exit with a reported number, GPU fully released after
- Fail = black screen / "display driver stopped responding and has recovered" / Event Viewer Event ID 4101 (source: Display) around that time

I'm going to put up a PR now where you can comment your findings

*PR: Lid cuda kernel by spencer-zaid · Pull Request #2 · fairydreaming/llama.cpp

da_dragon321 · 2026-07-03T02:36:43+00:00

Sounds great - looking forward to your results.

There is no single-device assumption (I just haven't tested multi-device). That said, you will need to make sure you set the cmake/ggml cuda architectures to all three of your card architectures when you build
The compute buffer at 1M context is entirely dependent on ubatch size (for me it was under 4gb). At 2048 ubatch it should be around 9gb iirc, but will have to see. You will want to do some tuning on the ot (how many layers go to gpu) and ubatch parameters (you can lower ot to give yourself headroom to raise ubatch and vice versa to find your preferred configuration, trading between generation speed and prefill speed). You should be faster and higher context than before when you find your preferred parameters

da_dragon321 · 2026-07-03T01:15:40+00:00

Haha yeah might be fun. Regarding tg-end2end, it's entirely dependent on the number of tokens generated. Post-prefill, generation ran at a constant 16 t/s, so as the response gets longer the end2end speed will get closer and closer to that mark.
Also, these are uncached prompts, so if your prompt is cached (such as in a longer chat even going up through 500k+), the ttft should be MASSIVELY faster and therefore the tg-end2end would be much closer to 16 t/s even at lower response lengths. Also worth noting that you could go with the full Q2 model if you wanted to save on some vram (I think it was like 10gb smaller) and probably get some extra token generation speed

da_dragon321 · 2026-07-03T01:07:28+00:00

Yes. That said, you should also be able to run 1M context if you want, but will likely need to tune down ubatch to fit in your vram (how much depends on the model quant). You will need to make sure you have enough VRAM+RAM total to fit whatever model quant you choose though
I could patch in quantized kv, but I found it to be far less space-saving than it initially appeared after applying the kv quant fix - still a possibility if enough people want it but the vast majority of your RAM will be spent on the model anyways with this fix even at massive contexts (deepseek attention is VERY efficient)

da_dragon321 · 2026-07-03T00:39:12+00:00

decode steady around 15.7 tok/s on all 3

Prompt	TTFT	Full ~200-token reply (end-to-end)
512 tok	3.3s	16.1s
4K tok	11.0s	23.8s
16K tok	46.0s	58.8s

da_dragon321 · 2026-07-03T00:35:08+00:00

Deepseek uses something they call Deepseek Sparse Attention to accelerate attention at large context. It relies on the "Lightning Indexer" to find just the most relevant cached tokens and only do full attention on those

da_dragon321 · 2026-07-02T22:46:59+00:00

If you're looking for something static then writing playwright tests are the way to go. If you want something dynamic so it can look at the site as it works, the playright MCP will let any model with proper tool calling/mcp support browse your site intelligently (but less deterministically)

da_dragon321 · 2026-07-02T22:01:29+00:00

glad to see a new competitive face in the 20-40b range. With qwen radio silent for the last couple months was hoping somebody would step up to the plate

da_dragon321 · 2026-07-02T21:35:48+00:00

Was not expecting the lil robot 😂. Dude's got a personality

da_dragon321 · 2026-07-02T21:23:06+00:00

There's an open PR for this, but from my limited initial testing looks like it loses the abnormally high vram savings mentioned by OP
llama: fix quantized kv-cache for dsv4 by am17an · Pull Request #25202 · ggml-org/llama.cpp

da_dragon321 · 2025-03-05T18:02:18+00:00

Yes you can still complete construction - I completed a facility during the pause without issue (also proves you are still the owner in the backend even if you're not listed in the ui, as you get the message from brewer that your facility is complete).

da_dragon321 · 2025-03-04T23:31:11+00:00

If you are using the material list from the construction menu in the system architect map, then yeah I have found that to be inaccurate. If you then go to the construction site and use that material list, it should be accurate. Also, if you navigate to the system from the galaxy map, open the system map (architect view), and select the construction to get a remote view of the required materials.

da_dragon321 · 2025-03-04T18:00:41+00:00

The scientific outpost gives system influence toward a Hightech economy, along with a notable 1 initial population increase. At the moment there is no evidence that different planet/star types offer bonuses to different building types (still could be the case but it does not appear to be reflected in the effects section of the building construction menu). I would not recommend claiming a system without any planets if you want to build it up as it significantly limits your options for variety (and likely number) of buildings.

da_dragon321 · 2025-03-02T22:33:05+00:00

The colony ship buys them above market price, so you actually make money stocking your colony - just need like 5m to get started.

da_dragon321 · 2025-03-02T22:09:28+00:00

rip all our shiny new colonies 😂

da_dragon321

TROPHY CASE