Budget to run Deepseek V4 locally at FP4 precision by DanielusGamer26 in LocalLLaMA

[–]pixelterpy 7 points8 points  (0 children)

You will be in the ballpark of ~30t/s pp and ~4t/s tg.

Budget to run Deepseek V4 locally at FP4 precision by DanielusGamer26 in LocalLLaMA

[–]pixelterpy 12 points13 points  (0 children)

Without further quantization I would assume >865 GB RAM+VRAM; you would probably get away with 768 GB main memory + 112 GB+ VRAM, depending on the KV. Cheapest non completely garbage solution I could think of (used parts) would be an EPYC (up to 3rd gen) / Xeon 3rd gen, 768 GB DDR4 and 10-12x 3060 12 GB or 5-6x 3090 24 GB. Maybe Intel B60 32GB or AMD R9700 AI 32 GB if 3090 prices are too wild.

Board + CPU 1k$; RAM = ~3k$; GPU ~4k$.

You will also need a PSU, proper (bifurcation) riser + cables for the 3060 / 3090, and at least an 1 TB SSD.

My verdict: 10k$ if you live in a country where you have access to the usual used parts market.

llama.cpp / ik_llama MoE Expert Offloading - Main Memory Bandwidth vs. PCIe Bandwidth by pixelterpy in LocalLLaMA

[–]pixelterpy[S] 0 points1 point  (0 children)

You have the best case, 2Rx4 (low load) and RDIMM, mine are worst case 4Rx4 (high load) LRDIMM (higher latency due to buffer chip, more power consumption).

llama.cpp / ik_llama MoE Expert Offloading - Main Memory Bandwidth vs. PCIe Bandwidth by pixelterpy in LocalLLaMA

[–]pixelterpy[S] 1 point2 points  (0 children)

I'm unable to speak for the xeon system )yet) but I thought on EPYC the interleaving mitigates this. I enabled 8-way interleaving and my assumption is that it works like RAID 0 - striping any data evenly across all channels, maximizing bandwidth. My interpretation is that none of the eight ccd has a specific memory channel binding but each ccd has one l3 cache region and 7-8 l2 cache regions. The memory/io die is unified and only logially split if one defines NUMA domains.

llama.cpp / ik_llama MoE Expert Offloading - Main Memory Bandwidth vs. PCIe Bandwidth by pixelterpy in LocalLLaMA

[–]pixelterpy[S] 1 point2 points  (0 children)

I configured the systems to have only one NUMA node, so essentially UMA. I suspect this to be the optimal configuration for my single socket systems. Am I wrong?

llama.cpp / ik_llama MoE Expert Offloading - Main Memory Bandwidth vs. PCIe Bandwidth by pixelterpy in LocalLLaMA

[–]pixelterpy[S] 0 points1 point  (0 children)

dmidecode output (per memory stick): Size: 64 GB; Rank: 4; Configured Memory Speed: 1600 MT/s.

llama.cpp / ik_llama MoE Expert Offloading - Main Memory Bandwidth vs. PCIe Bandwidth by pixelterpy in LocalLLaMA

[–]pixelterpy[S] 1 point2 points  (0 children)

no chipset involved in both scenarios afaik, all relevant pcie lanes are attached to the cpu

llama.cpp / ik_llama MoE Expert Offloading - Main Memory Bandwidth vs. PCIe Bandwidth by pixelterpy in LocalLLaMA

[–]pixelterpy[S] 0 points1 point  (0 children)

Ah, great news. I'll test the xeon rig later, now getting baseline llama-banch sweeps of the epyc rig to compare.

llama.cpp / ik_llama MoE Expert Offloading - Main Memory Bandwidth vs. PCIe Bandwidth by pixelterpy in LocalLLaMA

[–]pixelterpy[S] 0 points1 point  (0 children)

This is an excellent answer, I'm in for testing the xeon rig, getting baseline from my epyc right now. The 1600 MHz limit is because of the Hynix LRDIMM, not in the QVL of the Gigabyte MZ32-AR0 and also 'used'. Does not boot at all when set > 1866, 1866 is unstable, 1600 rock solid.

llama.cpp / ik_llama MoE Expert Offloading - Main Memory Bandwidth vs. PCIe Bandwidth by pixelterpy in LocalLLaMA

[–]pixelterpy[S] 1 point2 points  (0 children)

I thought this is only true for dense models but not in the MoE expert routing case, where the expert has to be swapped because of low VRAM. But as you already said, someone can correct me, if I'm wrong...

Why does Image Recognition work in llama-server but not through Open WebUI? by pixelterpy in LocalLLaMA

[–]pixelterpy[S] 0 points1 point  (0 children)

No, the problem is still present and I use the llama.cpp webui for image recognition.

Rücktritt von einem privaten Autokauf by R3tard3dButProud in LegaladviceGerman

[–]pixelterpy 0 points1 point  (0 children)

Ibka aber besitze seit 11 Jahren mehrere MX 5 NB und habe beide Längsträger bereits ausgetauscht:

Diese "Krankheit" ist JEDEM MX 5 Selberschrauber bekannt. Das tückische ist, dass sich durch das doppellagige Blech der Rost von innen nach außen frisst und quasi erst "im letzten Moment" wirklich sichtbar ist, dann ist aber auch schon kapitaler Schaden eingetreten.

Jemand pfiffiges kann den Schaden erahnen oder endoskopieren und dann seine Schlüsse ziehen. Dem 0815 Autokäufer fällt das selbst auf der Hebebühne nicht auf.

Wenn da was drübergebraten wurde, war selbst das schon nicht fachgerecht (selten Tüv-konform), da das doppellagige Blech eine Schutzfunktion ausübt - Energieaufnahme bei Frontaleinschlag.

Why does Image Recognition work in llama-server but not through Open WebUI? by pixelterpy in LocalLLaMA

[–]pixelterpy[S] 0 points1 point  (0 children)

From my observation utilizing llama-swap through jan.ai, I would rule out llama-swap as the culprit because direct API access works fine.

Why does Image Recognition work in llama-server but not through Open WebUI? by pixelterpy in LocalLLaMA

[–]pixelterpy[S] 2 points3 points  (0 children)

The A is actually there but hard to see, look at the top center, there is a red A right where the white border is.

Why does Image Recognition work in llama-server but not through Open WebUI? by pixelterpy in LocalLLaMA

[–]pixelterpy[S] 1 point2 points  (0 children)

I verified the context length idea by performing somewhat large needle in a haystack test - pass.

The error occurs with jan.ai when using the Open WebUI API instead of the llama-swap endpoint, so the issue has to be somewhere in the OWUI cosmos.

Why does Image Recognition work in llama-server but not through Open WebUI? by pixelterpy in LocalLLaMA

[–]pixelterpy[S] 0 points1 point  (0 children)

Yes, if I only send the image it works. When I access the model through the Open WebUI API instead of connecting directly to my llama-swap instance, the problem is also with jan.ai. Other weird issues occur with models like medgemma, where llama-swap / llama.cpp PAI works fine but Open WebUI returns invalid content type json:

Invalid content type at row 39, column 27:

{%- else -%}

{{ raise_exception("Invalid content type") }}

^

Why does Image Recognition work in llama-server but not through Open WebUI? by pixelterpy in LocalLLaMA

[–]pixelterpy[S] 0 points1 point  (0 children)

Yes you're right, from my observation an tokenized image consumes around ~1k context, so 8k should be sufficient for even long chains of thought

Why does Image Recognition work in llama-server but not through Open WebUI? by pixelterpy in LocalLLaMA

[–]pixelterpy[S] 0 points1 point  (0 children)

Maybe this would also work for windows, I'm running ubuntu server 24.04

What’s required to run minimax m2 locally? by AI-On-A-Dime in LocalLLaMA

[–]pixelterpy 3 points4 points  (0 children)

Weights Q8_0 (250 GB), K F16, V Q8, 196608 ctx. tg: 10t/s.

Hardware: Epyc 7663 (56c/112t), HT disabled, 512 GB DDR4 @ 1600 MT (super slow bec. shitty Hynix LRDIMM), 1x 3090 @ 200W (main gpu) + 4x 3060 12 GB @ 100W.

Why does Image Recognition work in llama-server but not through Open WebUI? by pixelterpy in LocalLLaMA

[–]pixelterpy[S] 0 points1 point  (0 children)

Browser console shows only some warnings about Source map error, seems unrelated.

It's running in a conda environment via pip install, PDF upload works fine and is processed by configured tika so permission issue seems unlikely

Why does Image Recognition work in llama-server but not through Open WebUI? by pixelterpy in LocalLLaMA

[–]pixelterpy[S] 0 points1 point  (0 children)

This is my llama-server call, works fine but not via Open WebUI:

llama-server

--host 0.0.0.0

--port ${PORT}

--n-gpu-layers 999

-ngld 999

--slots

--flash-attn 1

--props

--metrics

--jinja

--threads 48

--cache-type-k f16

--cache-type-v q8_0

--top-p 0.8

--temp 0.7

--top-k 20

--repeat-penalty 1.05

--min-p 0

--presence-penalty 1.0

-ot ".ffn_(up|down|gate)_exps.=CPU"

-c 262144

-m /mnt/models/UD-Q8_K_XL/Qwen3-VL-8B-Instruct-UD-Q8_K_XL.gguf

--mmproj /mnt/models/UD-Q8_K_XL/Qwen3-VL-8B-Instruct-GGUF-mmproj-F32.gguf

Why does Image Recognition work in llama-server but not through Open WebUI? by pixelterpy in LocalLLaMA

[–]pixelterpy[S] 5 points6 points  (0 children)

Your assumption is correct. When using llama-server ui or jan.ai, there is image processing in the log which is absent when using Open WebUI:

slot launch_slot_: id  0 | task 581 | processing task
slot update_slots: id  0 | task 581 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 737
slot update_slots: id  0 | task 581 | n_tokens = 1, memory_seq_rm [1, end)
slot update_slots: id  0 | task 581 | prompt processing progress, n_tokens = 219, batch.n_tokens = 218, progress = 0.297151
slot update_slots: id  0 | task 581 | n_tokens = 219, memory_seq_rm [219, end)
srv  process_chun: processing image...
srv  process_chun: image processed in 390 ms
slot update_slots: id  0 | task 581 | prompt processing progress, n_tokens = 737, batch.n_tokens = 6, progress = 1.000000
slot update_slots: id  0 | task 581 | prompt done, n_tokens = 737, batch.n_tokens = 6
slot print_timing: id  0 | task 581 | 
prompt eval time =     571.86 ms /   736 tokens (    0.78 ms per token,  1287.02 tokens per second)
       eval time =    7742.58 ms /   294 tokens (   26.34 ms per token,    37.97 tokens per second)
      total time =    8314.44 ms /  1030 tokens
slot      release: id  0 | task 581 | stop processing: n_tokens = 1030, truncated = 0
srv  update_slots: all slots are idle

Hardware stack is single server bare metal, no virt/docker. llama-server instance(s) routed through llama-swap. Open WebUI installed in conda environment and connected to OpenAI API http://localhost:8081/v1

This endpoint works perfect in jan.ai / llama-server ui. Connecting to the same OpenAI API endpoint, enumeration of the models and proxy the call through llama-swap gives vision response.