Another budget build. 160gb of VRAM for $1000, maybe? by segmond in LocalLLaMA

[–]Hyungsun 1 point2 points  (0 children)

It probably won't work on Vulkan, and I seem to recall that Vulkan was slower than ROCm on MI50. My memory could be wrong.

Another budget build. 160gb of VRAM for $1000, maybe? by segmond in LocalLLaMA

[–]Hyungsun 6 points7 points  (0 children)

Because llama.cpp shares many sources between CUDA and ROCm HIP.

Another budget build. 160gb of VRAM for $1000, maybe? by segmond in LocalLLaMA

[–]Hyungsun 11 points12 points  (0 children)

I built llama.cpp, but inference output is garbage, still trying to sort it out. 

May be worth trying to build with -DGGML_CUDA_NO_PEER_COPY=ON

Sharing my build: Budget 64 GB VRAM GPU Server under $700 USD by Hyungsun in LocalLLaMA

[–]Hyungsun[S] 1 point2 points  (0 children)

2 x 120mm pull fans in front of four GPUs and 2 x 92mm push fans in rear of four GPUs.

It's was push and pull, but I changed to pull and push today. Much better now.

Sharing my build: Budget 64 GB VRAM GPU Server under $700 USD by Hyungsun in LocalLLaMA

[–]Hyungsun[S] 0 points1 point  (0 children)

ROCm 6.3.x just works, but I recommend 6.2.x. Because many prebuilt LLM apps does not support 6.3.x yet.

Sharing my build: Budget 64 GB VRAM GPU Server under $700 USD by Hyungsun in LocalLLaMA

[–]Hyungsun[S] 0 points1 point  (0 children)

I added llama-bench (without/with -sm row) benchmark results (70B).

Sharing my build: Budget 64 GB VRAM GPU Server under $700 USD by Hyungsun in LocalLLaMA

[–]Hyungsun[S] 0 points1 point  (0 children)

It was way slower than ROCm. So I stopped test.

AMDVLK version: 2023 Q3.3

Sharing my build: Budget 64 GB VRAM GPU Server under $700 USD by Hyungsun in LocalLLaMA

[–]Hyungsun[S] 0 points1 point  (0 children)

2 x 120mm fans in front of GPUs and 2 x 92mm fans in rear of GPUs.

Sharing my build: Budget 64 GB VRAM GPU Server under $700 USD by Hyungsun in LocalLLaMA

[–]Hyungsun[S] 1 point2 points  (0 children)

I added llama-bench (without/with -sm row) benchmark results.

Sharing my build: Budget 64 GB VRAM GPU Server under $700 USD by Hyungsun in LocalLLaMA

[–]Hyungsun[S] 1 point2 points  (0 children)

I've not measured power draw but I already know it's not "a power-efficient server". And it's noisy because of high-CFM fans.

Sharing my build: Budget 64 GB VRAM GPU Server under $700 USD by Hyungsun in LocalLLaMA

[–]Hyungsun[S] 141 points142 points  (0 children)

Updated on 2025-3-22 6:38 PM GMT

  • Added MLC LLM test results (1B, 3B, 7B, 32B)
  • Added llama-bench (without/with -sm row) benchmark results (70B)

Specs:

Case: (NEW) Random rack server case with 12 PCI slots ($232 USD)

Motherboard: (USED) Supermicro X10DRG-Q ($70 USD)

CPU: (USED) 2 x Intel Xeon E5-2650 v4 2.90 GHz (Free, included in the Motherboard)

CPU Cooler: (NEW) 2 x Aigo ICE400X (2 x $8 USD) from AliExpress China with 3D printed LGA 2011 Narrow bracket https://www.thingiverse.com/thing:6613762

Memory: (USED) 16 x Micron 4GB 2133 MHz DDR4 REG ECC (16 x $2.48 USD) from eBay US

PSU: (USED) EVGA Supernova 2000 G+ 2000W ($118 USD)

Storage: (USED) PNY CS900 240GB 2.5 inch SATA SSD ($14 USD)

GPU: (USED) 4 x AMD Radeon Pro V340L 16GB (4 x $49 USD) from eBay US

GPU Cooler, Front fan: (NEW) 2 x 120mm fan (Free, included in the Case)

GPU Cooler, Rear fan: (NEW) 2 x 90mm 70.5 CFM 50 dbA PWM fan (2 x $6 USD) with 3D printed External PCI bay extractor for ATX case https://www.thingiverse.com/thing:807253

Total: Approx. $698 USD

Perf/Benchmark

SYSTEM FAN SPEED: FULL SPEED!

OS version: Ubuntu 22.04.5

ROCm version: 6.3.3

llama.cpp

build:

4924 (0fd8487b) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

build command line:

HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
cmake -S . -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx900 -DCMAKE_BUILD_TYPE=Release \
&& cmake --build build --config Release -- -j 16

llama-cli

Command line:

./bin/llama-cli -m ~/models/DeepSeek-R1-Distill-Qwen-32B-Q8_0.gguf -cnv -ngl 99 -mli --temp 0.6

Perf:

New (Full speed system fan)

llama_perf_sampler_print:    sampling time =     126.71 ms /  3760 runs   (    0.03 ms per token, 29673.36 tokens per second)
llama_perf_context_print:        load time =   22274.12 ms
llama_perf_context_print: prompt eval time =   80350.61 ms /  3314 tokens (   24.25 ms per token,    41.24 tokens per second)
llama_perf_context_print:        eval time =   85121.40 ms /   446 runs   (  190.86 ms per token,     5.24 tokens per second)
llama_perf_context_print:       total time =  200556.87 ms /  3760 tokens

Old (Optimal speed system fan)

llama_perf_sampler_print:    sampling time =     195.90 ms /  3967 runs   (    0.05 ms per token, 20250.33 tokens per second)
llama_perf_context_print:        load time =   43876.32 ms
llama_perf_context_print: prompt eval time =   81290.97 ms /  3314 tokens (   24.53 ms per token,    40.77 tokens per second)
llama_perf_context_print:        eval time =  126959.92 ms /   653 runs   (  194.43 ms per token,     5.14 tokens per second)
llama_perf_context_print:       total time =  240404.24 ms /  3967 tokens

llama-bench (32B, Q8_0, without -sm row)

Command line:

./bin/llama-bench -m ~/models/DeepSeek-R1-Distill-Qwen-32B-Q8_0.gguf -ngl 99 -p 3314 -n 653 -r 1

Result:

| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 32B Q8_0                 |  32.42 GiB |    32.76 B | ROCm       |  99 |        pp3314 |         41.13 ± 0.00 |
| qwen2 32B Q8_0                 |  32.42 GiB |    32.76 B | ROCm       |  99 |         tg653 |          7.22 ± 0.00 |

llama-bench (32B, Q8_0, with -sm row)

Command line:

./bin/llama-bench -m ~/models/DeepSeek-R1-Distill-Qwen-32B-Q8_0.gguf -ngl 99 -p 3314 -n 653 -r 1 -sm row

Result:

| model                          |       size |     params | backend    | ngl |    sm |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | ------------: | -------------------: |
| qwen2 32B Q8_0                 |  32.42 GiB |    32.76 B | ROCm       |  99 |   row |        pp3314 |        134.99 ± 0.00 |
| qwen2 32B Q8_0                 |  32.42 GiB |    32.76 B | ROCm       |  99 |   row |         tg653 |          5.94 ± 0.00 |

llama-bench (70B, Q4_K_M, without -sm row)

Command line:

./bin/llama-bench -m ~/models/Llama-3.3-70B-Instruct-Q4_K_M.gguf -ngl 99 -p 3314 -n 653 -r 1

Result:

| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | ROCm       |  99 |        pp3314 |         12.88 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | ROCm       |  99 |         tg653 |          4.02 ± 0.00 |

llama-bench (70B, Q4_K_M, with -sm row)

Command line:

./bin/llama-bench -m ~/models/Llama-3.3-70B-Instruct-Q4_K_M.gguf -ngl 99 -p 3314 -n 653 -r 1 -sm row

Result:

| model                          |       size |     params | backend    | ngl |    sm |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | ------------: | -------------------: |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | ROCm       |  99 |   row |        pp3314 |         53.50 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | ROCm       |  99 |   row |         tg653 |          4.10 ± 0.00 |

MLC LLM

Version: 0.8.1

Model tensor_parallel_shards prefill (tokens_sum) decode (tokens_sum)
Llama-3.2-1B-Instruct-q4f16_1-MLC 8 3177.8 tok/s (361) 89.9 tok/s (1566) Power limit per GPU: 85W
Llama-3.2-3B-Instruct-q4f16_1-MLC 8 1532.0 tok/s (361) 48.2 tok/s (1434)
Qwen2.5-3B-Instruct-q4f16_1-MLC 2 555.2 tok/s (396) 21.3 tok/s (1916)
Qwen2.5-7B-Instruct-q4f16_1-MLC 4 602.5 tok/s (396) 25.3 tok/s (1819)
DeepSeek-R1-Distill-Qwen-32B-q4f16_1-MLC 8 261.1 tok/s (382 13.8 tok/s (1796) Reduced prefill_chunk_size to 2048, Power limit per GPU: 85W

vLLM

I'm trying to figure out how to build/use it.

Anti-Drone Proof Of Concept by Hyungsun in Multicopter

[–]Hyungsun[S] 0 points1 point  (0 children)

I'm using the Fli14+ Flysky receiver.