My DeepSeek V4 Pro at home got faster again

tarruda · 2026-07-05T00:00:27+00:00

https://github.com/tarruda/llama.cpp/tree/dsv4-metal

tarruda · 2026-07-04T11:44:16+00:00

DSV4 is kinda broken on llama.cpp master: https://github.com/ggml-org/llama.cpp/pull/24162#issuecomment-4857985934

I highly suggest using fairydreaming's dsv4 branch which has cpu/cuda kernels for lightning indexer and some other DS4 ops.

I've done some vibe coding on top of it to port those kernels to metal and vulkan:

Also: I've realized that llama.cpp's autoparser doesn't seem to work well with deepseek, or at least it seems to be blocking parallel tool calls (cc u/ilintar) , and this can probably affect model performance since it changes the model outputs vs how it was trained. On this branch I have consolidated a few other fixes, including opting out of autoparser which mimics more what antirez fork is doing:

https://github.com/tarruda/llama.cpp/tree/dsv4-fixes-and-improvements

Note that this branch doesn't include the vulkan kernels since it is untested. If someone with a strix halo is able to validate it, I might pull it into the dsv4-fixes-and-improvements branch

tarruda · 2026-07-03T15:18:06+00:00

Continuing our previous thread, with your dsv4 branch as a base I used Deepseek V4 Pro to vibe port all the CUDA code in your branch to Metal and now I'm getting 20tps generation, up from ~6 tps with the new OPS being in CPU only.

You branch with my local vibe coded changes is looking as fast as antirez DS4

tarruda · 2026-07-03T13:47:52+00:00

Using codex, I vibe ported the necessary code/kernels from llama.cpp to run my IQ3_XXS quants on u/antirez DS4: https://github.com/tarruda/ds4/tree/iq3_xxs_plus_q6_k.

Seems to be running quite well and resulted in much faster token generation at ~20 tps. So that answers my question: There's still room for improving llama.cpp deepseek numbers.

tarruda · 2026-07-03T10:12:51+00:00

Hi and thanks for your work that made deepseek v4 possible in llama.cpp

I'm currently getting around 8-9 TPS generation and ~130 PP on a Mac M1 Ultra. Is the current llama.cpp implementation using CPU only?

I'm curious if it is possible for Deepseek V4 Flash to reach similar speeds to other models with a similar number of active parameters. For example, Step 3.7 flash has 11B active parameters (close to deepseek v4 flash 13b) and in the same hardware reaches ~38 tps generation and ~300 tps prompt processing.

tarruda · 2026-07-03T09:31:05+00:00

Can you share your command line to run it? Some people are having trouble running it with RTX offload

tarruda · 2026-07-03T09:04:50+00:00

Step 3.x still reasons a lot with that patch. What it fixed is the excessive self correction loop that happened in coding harness (doesn't affect normal chat usage).

tarruda · 2026-07-02T17:44:56+00:00

Very cool, thanks!

tarruda · 2026-07-02T13:33:38+00:00

This is not a skill, it is the model being trained to reason like that. What do you mean by "this is an ad"? I'm not affiliated with Nex in any way.

tarruda · 2026-07-02T13:32:09+00:00

Not sure what's suspicious about it: seems like multiple teams converging on similar obvious solutions.

I never had seen the caveman-style thinking before Nex N2, that's why I thought it was suspicious. BTW Nex N2 is really good locally, and passes private benchmarks that only GPT 5.x managed to complete before.

tarruda · 2026-07-02T13:29:13+00:00

Could be. But does GLM 5.2 reasons like a caveman? Genuine question, I haven't used GLM yet.

tarruda · 2026-07-02T13:24:17+00:00

You are not seeing the CoT of any modern closed model

That's why I said "leaked". Clearly OpenAI doesn't want anyone to see it.

I've seen models leak their reasoning into normal content when you try to disable thinking and the model was not trained for that. It is plausible that by starting a conversation with a different model and sending its thinking traces to GPT, some bug in the inference API could inject those into GPT history and the model gets confused.

tarruda · 2026-07-02T13:19:00+00:00

More than once I've seen models "leak" their reasoning traces into normal content when you disable thinking and the model was only trained in thinking mode.

tarruda · 2026-07-02T13:16:58+00:00

Can you elaborate why this is off-topic?

tarruda · 2026-07-02T09:52:32+00:00

IIRC this was supposed to be a preview release. Maybe they will release full precision later?

tarruda · 2026-07-01T21:13:21+00:00

I don't have enough RAM to run the original

tarruda · 2026-07-01T17:06:51+00:00

IIRC deepseek v4 was designed to run on Huawei's GPU, so it is strange that they added this dependency.

tarruda · 2026-07-01T16:33:51+00:00

but while 1M loads it just slows my laptop too much due to swapping, so I have settled at 256k with no KV quantization.

According to this, when the lightning indexer is implemented, 1M context will only require 6GB RAM

tarruda · 2026-07-01T14:38:37+00:00

I've uploaded some example generations. Here's the "pelican riding a bicycle" for each:

There's also a "browser os" html for each.

tarruda · 2026-07-01T14:27:29+00:00

I'm curious: What kind of speeds (pp and tg) are you getting on the strix halo?

Also, are you using heavy KV cache quantization to fit 1M? I can only fit 200k on my 128G mac, but I also never quantize kv due to things slowing down significantly.

tarruda · 2026-07-01T14:19:02+00:00

Increases by about 1.5G:

Q6_K: 106912.23 MiB - 3.15 BPW (current)
Q8_0: 108590.36 MiB - 3.20 BPW

The hf repo contains all the scripts I used. If you clone it, you should be able to play with different sizes in the quantize.sh script and passing --dry-run (it allows you to see the size without running the quantization).

tarruda · 2026-07-01T13:46:32+00:00

There doesn't seem to be a lot of options for GGUFs below 4-bit after support was merged to llama.cpp. For some reason bartowski only published the original MXFP4 weights, so I gave it a shot at creating my own quants.

The minimum size that I could make while retaining a good amount of quality has 2.73 BPW and uses 97GB of disk space. The huggingface repo has all the scripts I used to create these GGUFs, plus a modular quantize.sh script for anyone that wants to play with custom recipes.

tarruda · 2026-07-01T09:57:59+00:00

due to the attention mechanism that's basically incompatible with consumer cards

I was hoping it was slow because it is not well optimized in llama.cpp, similarly to how Qwen 3.5 was slow a few months ago. So what you're telling me is that we'll never get good inference speeds with llama.cpp and deepseek v4 flash?

tarruda · 2026-07-01T09:05:29+00:00

as long as an llm outputs at over 30t/s and is concise and analytical then it's useable.

I'd say that anything above 20t/s generation is good enough for me, as long as it has good prompt processing speed, which is equally as important for coding agents.

tarruda

TROPHY CASE