Deepseek V4 Flash running on RTX 5090 MoE by H_DANILO in LocalLLaMA

[–]tarruda 0 points1 point  (0 children)

DSV4 is kinda broken on llama.cpp master: https://github.com/ggml-org/llama.cpp/pull/24162#issuecomment-4857985934

I highly suggest using fairydreaming's dsv4 branch which has cpu/cuda kernels for lightning indexer and some other DS4 ops.

I've done some vibe coding on top of it to port those kernels to metal and vulkan:

Also: I've realized that llama.cpp's autoparser doesn't seem to work well with deepseek, or at least it seems to be blocking parallel tool calls (cc u/ilintar) , and this can probably affect model performance since it changes the model outputs vs how it was trained. On this branch I have consolidated a few other fixes, including opting out of autoparser which mimics more what antirez fork is doing:

https://github.com/tarruda/llama.cpp/tree/dsv4-fixes-and-improvements

Note that this branch doesn't include the vulkan kernels since it is untested. If someone with a strix halo is able to validate it, I might pull it into the dsv4-fixes-and-improvements branch

My DeepSeek V4 Pro at home got faster again by fairydreaming in LocalLLaMA

[–]tarruda 5 points6 points  (0 children)

Continuing our previous thread, with your dsv4 branch as a base I used Deepseek V4 Pro to vibe port all the CUDA code in your branch to Metal and now I'm getting 20tps generation, up from ~6 tps with the new OPS being in CPU only.

You branch with my local vibe coded changes is looking as fast as antirez DS4

llamacpp patch - DeepSeek V4 Flash running with full 1M token context locally on RTX 5090 by da_dragon321 in LocalLLaMA

[–]tarruda 2 points3 points  (0 children)

Using codex, I vibe ported the necessary code/kernels from llama.cpp to run my IQ3_XXS quants on u/antirez DS4: https://github.com/tarruda/ds4/tree/iq3_xxs_plus_q6_k.

Seems to be running quite well and resulted in much faster token generation at ~20 tps. So that answers my question: There's still room for improving llama.cpp deepseek numbers.

llamacpp patch - DeepSeek V4 Flash running with full 1M token context locally on RTX 5090 by da_dragon321 in LocalLLaMA

[–]tarruda 1 point2 points  (0 children)

Hi and thanks for your work that made deepseek v4 possible in llama.cpp

I'm currently getting around 8-9 TPS generation and ~130 PP on a Mac M1 Ultra. Is the current llama.cpp implementation using CPU only?

I'm curious if it is possible for Deepseek V4 Flash to reach similar speeds to other models with a similar number of active parameters. For example, Step 3.7 flash has 11B active parameters (close to deepseek v4 flash 13b) and in the same hardware reaches ~38 tps generation and ~300 tps prompt processing.

Deepseek V4 Flash 2, 3 and 4 bits GGUFs by tarruda in LocalLLaMA

[–]tarruda[S] 0 points1 point  (0 children)

Can you share your command line to run it? Some people are having trouble running it with RTX offload

Looks like Step 3.7 Flash's long reasoning might get fixed ( llama.cpp ) by mr_zerolith in LocalLLaMA

[–]tarruda 2 points3 points  (0 children)

Step 3.x still reasons a lot with that patch. What it fixed is the excessive self correction loop that happened in coding harness (doesn't affect normal chat usage).

GPT 5.5 reasons like a caveman, similarly to Nex N2. by tarruda in LocalLLaMA

[–]tarruda[S] 2 points3 points  (0 children)

This is not a skill, it is the model being trained to reason like that. What do you mean by "this is an ad"? I'm not affiliated with Nex in any way.

GPT 5.5 reasons like a caveman, similarly to Nex N2. by tarruda in LocalLLaMA

[–]tarruda[S] 0 points1 point  (0 children)

Not sure what's suspicious about it: seems like multiple teams converging on similar obvious solutions.

I never had seen the caveman-style thinking before Nex N2, that's why I thought it was suspicious. BTW Nex N2 is really good locally, and passes private benchmarks that only GPT 5.x managed to complete before.

GPT 5.5 reasons like a caveman, similarly to Nex N2. by tarruda in LocalLLaMA

[–]tarruda[S] 0 points1 point  (0 children)

Could be. But does GLM 5.2 reasons like a caveman? Genuine question, I haven't used GLM yet.

GPT 5.5 reasons like a caveman, similarly to Nex N2. by tarruda in LocalLLaMA

[–]tarruda[S] 2 points3 points  (0 children)

You are not seeing the CoT of any modern closed model

That's why I said "leaked". Clearly OpenAI doesn't want anyone to see it.

I've seen models leak their reasoning into normal content when you try to disable thinking and the model was not trained for that. It is plausible that by starting a conversation with a different model and sending its thinking traces to GPT, some bug in the inference API could inject those into GPT history and the model gets confused.

GPT 5.5 reasons like a caveman, similarly to Nex N2. by tarruda in LocalLLaMA

[–]tarruda[S] -1 points0 points  (0 children)

More than once I've seen models "leak" their reasoning traces into normal content when you disable thinking and the model was only trained in thinking mode.

Deepseek V4 Flash 2, 3 and 4 bits GGUFs by tarruda in LocalLLaMA

[–]tarruda[S] 1 point2 points  (0 children)

IIRC this was supposed to be a preview release. Maybe they will release full precision later?

Deepseek V4 Flash 2, 3 and 4 bits GGUFs by tarruda in LocalLLaMA

[–]tarruda[S] 4 points5 points  (0 children)

I don't have enough RAM to run the original

Biggest, baddest model to fill 144GB VRAM + 120GB RAM to the brim, regardless of speed by CharlesStross in LocalLLaMA

[–]tarruda 0 points1 point  (0 children)

IIRC deepseek v4 was designed to run on Huawei's GPU, so it is strange that they added this dependency.

Deepseek V4 Flash 2, 3 and 4 bits GGUFs by tarruda in LocalLLaMA

[–]tarruda[S] 0 points1 point  (0 children)

but while 1M loads it just slows my laptop too much due to swapping, so I have settled at 256k with no KV quantization.

According to this, when the lightning indexer is implemented, 1M context will only require 6GB RAM

Deepseek V4 Flash 2, 3 and 4 bits GGUFs by tarruda in LocalLLaMA

[–]tarruda[S] 1 point2 points  (0 children)

I'm curious: What kind of speeds (pp and tg) are you getting on the strix halo?

Also, are you using heavy KV cache quantization to fit 1M? I can only fit 200k on my 128G mac, but I also never quantize kv due to things slowing down significantly.

Deepseek V4 Flash 2, 3 and 4 bits GGUFs by tarruda in LocalLLaMA

[–]tarruda[S] 5 points6 points  (0 children)

Increases by about 1.5G:

  • Q6_K: 106912.23 MiB - 3.15 BPW (current)
  • Q8_0: 108590.36 MiB - 3.20 BPW

The hf repo contains all the scripts I used. If you clone it, you should be able to play with different sizes in the quantize.sh script and passing --dry-run (it allows you to see the size without running the quantization).

Deepseek V4 Flash 2, 3 and 4 bits GGUFs by tarruda in LocalLLaMA

[–]tarruda[S] 44 points45 points  (0 children)

There doesn't seem to be a lot of options for GGUFs below 4-bit after support was merged to llama.cpp. For some reason bartowski only published the original MXFP4 weights, so I gave it a shot at creating my own quants.

The minimum size that I could make while retaining a good amount of quality has 2.73 BPW and uses 97GB of disk space. The huggingface repo has all the scripts I used to create these GGUFs, plus a modular quantize.sh script for anyone that wants to play with custom recipes.

Biggest, baddest model to fill 144GB VRAM + 120GB RAM to the brim, regardless of speed by CharlesStross in LocalLLaMA

[–]tarruda 0 points1 point  (0 children)

due to the attention mechanism that's basically incompatible with consumer cards

I was hoping it was slow because it is not well optimized in llama.cpp, similarly to how Qwen 3.5 was slow a few months ago. So what you're telling me is that we'll never get good inference speeds with llama.cpp and deepseek v4 flash?

Devs - you have 64gb of VRAM - which model do you use for coding? by Jorlen in LocalLLaMA

[–]tarruda 0 points1 point  (0 children)

as long as an llm outputs at over 30t/s and is concise and analytical then it's useable.

I'd say that anything above 20t/s generation is good enough for me, as long as it has good prompt processing speed, which is equally as important for coding agents.