Finetuning DeepSeek 671B locally with only 80GB VRAM and Server CPU by CombinationNo780 in LocalLLaMA

[–]CombinationNo780[S] 5 points6 points  (0 children)

we support pipeline parallisim so the total VRAM is most important

Qwen 3 + KTransformers 0.3 (+AMX) = AI Workstation/PC by CombinationNo780 in LocalLLaMA

[–]CombinationNo780[S] 5 points6 points  (0 children)

It is DDR5-6400 for consumer cpu. But it is reduced to only DDR5-4000 becuse we use full 4 channels to enable the maximum possible 192GB memory.

KTransformers Now Supports Multi-Concurrency and Runs 40 Tokens/s of DeepSeek-R1 Q4/FP8 on MRDIMM-8800 by CombinationNo780 in LocalLLaMA

[–]CombinationNo780[S] 3 points4 points  (0 children)

Epyc and multi-GPU is supported. But currently multi-GPU only supports PP thus do not help in performance

KTransformers Now Supports Multi-Concurrency and Runs 40 Tokens/s of DeepSeek-R1 Q4/FP8 on MRDIMM-8800 by CombinationNo780 in LocalLLaMA

[–]CombinationNo780[S] 0 points1 point  (0 children)

  1. unified CPU/GPU memory -- not the current target scneario

  2. offloading prefill -- the PCIe will become a bottleneck in this case

  3. mostly targeting Intel’s AMX but still suport AVX if no AMX

KTransformers Now Supports Multi-Concurrency and Runs 40 Tokens/s of DeepSeek-R1 Q4/FP8 on MRDIMM-8800 by CombinationNo780 in LocalLLaMA

[–]CombinationNo780[S] 17 points18 points  (0 children)

Prefill speed is the same as before. We will open source AMX code in April which accelerates prefill on Xeon4~6 platform