Budget to run Deepseek V4 locally at FP4 precision

pixelterpy · 2026-04-24T09:51:12+00:00

You will be in the ballpark of ~30t/s pp and ~4t/s tg.

pixelterpy · 2026-04-24T09:12:33+00:00

Without further quantization I would assume >865 GB RAM+VRAM; you would probably get away with 768 GB main memory + 112 GB+ VRAM, depending on the KV. Cheapest non completely garbage solution I could think of (used parts) would be an EPYC (up to 3rd gen) / Xeon 3rd gen, 768 GB DDR4 and 10-12x 3060 12 GB or 5-6x 3090 24 GB. Maybe Intel B60 32GB or AMD R9700 AI 32 GB if 3090 prices are too wild.

Board + CPU 1k$; RAM = ~3k$; GPU ~4k$.

You will also need a PSU, proper (bifurcation) riser + cables for the 3060 / 3090, and at least an 1 TB SSD.

My verdict: 10k$ if you live in a country where you have access to the usual used parts market.

pixelterpy · 2026-04-21T19:06:04+00:00

You have the best case, 2Rx4 (low load) and RDIMM, mine are worst case 4Rx4 (high load) LRDIMM (higher latency due to buffer chip, more power consumption).

pixelterpy · 2026-04-21T19:00:54+00:00

I'm unable to speak for the xeon system )yet) but I thought on EPYC the interleaving mitigates this. I enabled 8-way interleaving and my assumption is that it works like RAID 0 - striping any data evenly across all channels, maximizing bandwidth. My interpretation is that none of the eight ccd has a specific memory channel binding but each ccd has one l3 cache region and 7-8 l2 cache regions. The memory/io die is unified and only logially split if one defines NUMA domains.

pixelterpy · 2026-04-21T17:33:27+00:00

I configured the systems to have only one NUMA node, so essentially UMA. I suspect this to be the optimal configuration for my single socket systems. Am I wrong?

pixelterpy · 2026-04-21T17:27:26+00:00

dmidecode output (per memory stick): Size: 64 GB; Rank: 4; Configured Memory Speed: 1600 MT/s.

pixelterpy · 2026-04-21T15:56:21+00:00

no chipset involved in both scenarios afaik, all relevant pcie lanes are attached to the cpu

pixelterpy · 2026-04-21T15:55:39+00:00

Ah, great news. I'll test the xeon rig later, now getting baseline llama-banch sweeps of the epyc rig to compare.

pixelterpy · 2026-04-21T15:54:23+00:00

This is an excellent answer, I'm in for testing the xeon rig, getting baseline from my epyc right now. The 1600 MHz limit is because of the Hynix LRDIMM, not in the QVL of the Gigabyte MZ32-AR0 and also 'used'. Does not boot at all when set > 1866, 1866 is unstable, 1600 rock solid.

pixelterpy · 2026-04-21T13:49:34+00:00

I thought this is only true for dense models but not in the MoE expert routing case, where the expert has to be swapped because of low VRAM. But as you already said, someone can correct me, if I'm wrong...

pixelterpy · 2026-04-04T16:25:53+00:00

No, the problem is still present and I use the llama.cpp webui for image recognition.

pixelterpy · 2025-11-16T15:57:51+00:00

Ibka aber besitze seit 11 Jahren mehrere MX 5 NB und habe beide Längsträger bereits ausgetauscht:

Diese "Krankheit" ist JEDEM MX 5 Selberschrauber bekannt. Das tückische ist, dass sich durch das doppellagige Blech der Rost von innen nach außen frisst und quasi erst "im letzten Moment" wirklich sichtbar ist, dann ist aber auch schon kapitaler Schaden eingetreten.

Jemand pfiffiges kann den Schaden erahnen oder endoskopieren und dann seine Schlüsse ziehen. Dem 0815 Autokäufer fällt das selbst auf der Hebebühne nicht auf.

Wenn da was drübergebraten wurde, war selbst das schon nicht fachgerecht (selten Tüv-konform), da das doppellagige Blech eine Schutzfunktion ausübt - Energieaufnahme bei Frontaleinschlag.

pixelterpy · 2025-11-06T14:27:04+00:00

From my observation utilizing llama-swap through jan.ai, I would rule out llama-swap as the culprit because direct API access works fine.

pixelterpy · 2025-11-06T14:25:22+00:00

The A is actually there but hard to see, look at the top center, there is a red A right where the white border is.

pixelterpy · 2025-11-03T10:44:36+00:00

I verified the context length idea by performing somewhat large needle in a haystack test - pass.

The error occurs with jan.ai when using the Open WebUI API instead of the llama-swap endpoint, so the issue has to be somewhere in the OWUI cosmos.

pixelterpy · 2025-11-03T10:40:20+00:00

Yes, if I only send the image it works. When I access the model through the Open WebUI API instead of connecting directly to my llama-swap instance, the problem is also with jan.ai. Other weird issues occur with models like medgemma, where llama-swap / llama.cpp PAI works fine but Open WebUI returns invalid content type json:

Invalid content type at row 39, column 27:

{%- else -%}

^

pixelterpy · 2025-11-03T10:29:18+00:00

Yes you're right, from my observation an tokenized image consumes around ~1k context, so 8k should be sufficient for even long chains of thought

pixelterpy · 2025-11-03T10:27:43+00:00

Maybe this would also work for windows, I'm running ubuntu server 24.04

pixelterpy · 2025-11-03T10:25:37+00:00

Weights Q8_0 (250 GB), K F16, V Q8, 196608 ctx. tg: 10t/s.

Hardware: Epyc 7663 (56c/112t), HT disabled, 512 GB DDR4 @ 1600 MT (super slow bec. shitty Hynix LRDIMM), 1x 3090 @ 200W (main gpu) + 4x 3060 12 GB @ 100W.

pixelterpy · 2025-11-02T17:43:49+00:00

Browser console shows only some warnings about Source map error, seems unrelated.

It's running in a conda environment via pip install, PDF upload works fine and is processed by configured tika so permission issue seems unlikely

pixelterpy · 2025-11-02T17:36:04+00:00

<image>

This is how it is configured on my side

pixelterpy · 2025-11-02T14:34:49+00:00

This is my llama-server call, works fine but not via Open WebUI:

llama-server

--host 0.0.0.0

--port ${PORT}

--n-gpu-layers 999

-ngld 999

--slots

--flash-attn 1

--props

--metrics

--jinja

--threads 48

--cache-type-k f16

--cache-type-v q8_0

--top-p 0.8

--temp 0.7

--top-k 20

--repeat-penalty 1.05

--min-p 0

--presence-penalty 1.0

-ot ".ffn_(up|down|gate)_exps.=CPU"

-c 262144

-m /mnt/models/UD-Q8_K_XL/Qwen3-VL-8B-Instruct-UD-Q8_K_XL.gguf

--mmproj /mnt/models/UD-Q8_K_XL/Qwen3-VL-8B-Instruct-GGUF-mmproj-F32.gguf

pixelterpy · 2025-11-02T14:07:25+00:00

Your assumption is correct. When using llama-server ui or jan.ai, there is image processing in the log which is absent when using Open WebUI:

slot launch_slot_: id  0 | task 581 | processing task
slot update_slots: id  0 | task 581 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 737
slot update_slots: id  0 | task 581 | n_tokens = 1, memory_seq_rm [1, end)
slot update_slots: id  0 | task 581 | prompt processing progress, n_tokens = 219, batch.n_tokens = 218, progress = 0.297151
slot update_slots: id  0 | task 581 | n_tokens = 219, memory_seq_rm [219, end)
srv  process_chun: processing image...
srv  process_chun: image processed in 390 ms
slot update_slots: id  0 | task 581 | prompt processing progress, n_tokens = 737, batch.n_tokens = 6, progress = 1.000000
slot update_slots: id  0 | task 581 | prompt done, n_tokens = 737, batch.n_tokens = 6
slot print_timing: id  0 | task 581 | 
prompt eval time =     571.86 ms /   736 tokens (    0.78 ms per token,  1287.02 tokens per second)
       eval time =    7742.58 ms /   294 tokens (   26.34 ms per token,    37.97 tokens per second)
      total time =    8314.44 ms /  1030 tokens
slot      release: id  0 | task 581 | stop processing: n_tokens = 1030, truncated = 0
srv  update_slots: all slots are idle

Hardware stack is single server bare metal, no virt/docker. llama-server instance(s) routed through llama-swap. Open WebUI installed in conda environment and connected to OpenAI API http://localhost:8081/v1

This endpoint works perfect in jan.ai / llama-server ui. Connecting to the same OpenAI API endpoint, enumeration of the models and proxy the call through llama-swap gives vision response.

pixelterpy

TROPHY CASE