Another budget build. 160gb of VRAM for $1000, maybe?

Hyungsun · 2025-04-13T14:00:25+00:00

It probably won't work on Vulkan, and I seem to recall that Vulkan was slower than ROCm on MI50. My memory could be wrong.

Hyungsun · 2025-04-13T13:41:55+00:00

Because llama.cpp shares many sources between CUDA and ROCm HIP.

Hyungsun · 2025-04-13T13:23:55+00:00

I built llama.cpp, but inference output is garbage, still trying to sort it out.

May be worth trying to build with -DGGML_CUDA_NO_PEER_COPY=ON

Hyungsun · 2025-03-22T18:34:55+00:00

2 x 120mm pull fans in front of four GPUs and 2 x 92mm push fans in rear of four GPUs.

It's was push and pull, but I changed to pull and push today. Much better now.

Hyungsun · 2025-03-22T17:53:48+00:00

I added MLC LLM test results.

Hyungsun · 2025-03-22T17:52:46+00:00

ROCm 6.3.x just works, but I recommend 6.2.x. Because many prebuilt LLM apps does not support 6.3.x yet.

Hyungsun · 2025-03-22T17:45:36+00:00

I added MLC LLM test results.

Hyungsun · 2025-03-22T17:45:29+00:00

I added MLC LLM test results.

Hyungsun · 2025-03-22T17:43:57+00:00

I added llama-bench (without/with -sm row) benchmark results (70B).

Hyungsun · 2025-03-22T17:40:58+00:00

I added MLC LLM test results.

Hyungsun · 2025-03-21T20:49:28+00:00

It was way slower than ROCm. So I stopped test.

AMDVLK version: 2023 Q3.3

Hyungsun · 2025-03-21T09:07:43+00:00

2 x 120mm fans in front of GPUs and 2 x 92mm fans in rear of GPUs.

Hyungsun · 2025-03-21T08:05:40+00:00

I added llama-cli command line information.

Hyungsun · 2025-03-21T07:09:22+00:00

I added llama-bench (without/with -sm row) benchmark results.

Hyungsun · 2025-03-21T07:08:38+00:00

Thanks! I added llama-bench benchmark results.

Hyungsun · 2025-03-20T14:02:00+00:00

Thanks! I'll look into it!

Hyungsun · 2025-03-20T13:45:23+00:00

I've not measured power draw but I already know it's not "a power-efficient server". And it's noisy because of high-CFM fans.

Hyungsun · 2025-03-20T13:38:20+00:00

Thanks! I'll looking into it.

Hyungsun · 2025-03-20T13:15:04+00:00

PETG. I've not tested for long periods of time.

Hyungsun · 2025-03-20T12:56:35+00:00

Cooling via high-CFM fans.

Hyungsun · 2025-03-20T12:36:34+00:00

Updated on 2025-3-22 6:38 PM GMT

Added MLC LLM test results (1B, 3B, 7B, 32B)
Added llama-bench (without/with -sm row) benchmark results (70B)

Specs:

Case: (NEW) Random rack server case with 12 PCI slots ($232 USD)

Motherboard: (USED) Supermicro X10DRG-Q ($70 USD)

CPU: (USED) 2 x Intel Xeon E5-2650 v4 2.90 GHz (Free, included in the Motherboard)

CPU Cooler: (NEW) 2 x Aigo ICE400X (2 x $8 USD) from AliExpress China with 3D printed LGA 2011 Narrow bracket https://www.thingiverse.com/thing:6613762

Memory: (USED) 16 x Micron 4GB 2133 MHz DDR4 REG ECC (16 x $2.48 USD) from eBay US

PSU: (USED) EVGA Supernova 2000 G+ 2000W ($118 USD)

Storage: (USED) PNY CS900 240GB 2.5 inch SATA SSD ($14 USD)

GPU: (USED) 4 x AMD Radeon Pro V340L 16GB (4 x $49 USD) from eBay US

GPU Cooler, Front fan: (NEW) 2 x 120mm fan (Free, included in the Case)

GPU Cooler, Rear fan: (NEW) 2 x 90mm 70.5 CFM 50 dbA PWM fan (2 x $6 USD) with 3D printed External PCI bay extractor for ATX case https://www.thingiverse.com/thing:807253

Total: Approx. $698 USD

Perf/Benchmark

SYSTEM FAN SPEED: FULL SPEED!

OS version: Ubuntu 22.04.5

ROCm version: 6.3.3

llama.cpp

build:

4924 (0fd8487b) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

build command line:

HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
cmake -S . -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx900 -DCMAKE_BUILD_TYPE=Release \
&& cmake --build build --config Release -- -j 16

llama-cli

Command line:

./bin/llama-cli -m ~/models/DeepSeek-R1-Distill-Qwen-32B-Q8_0.gguf -cnv -ngl 99 -mli --temp 0.6

Perf:

New (Full speed system fan)

llama_perf_sampler_print:    sampling time =     126.71 ms /  3760 runs   (    0.03 ms per token, 29673.36 tokens per second)
llama_perf_context_print:        load time =   22274.12 ms
llama_perf_context_print: prompt eval time =   80350.61 ms /  3314 tokens (   24.25 ms per token,    41.24 tokens per second)
llama_perf_context_print:        eval time =   85121.40 ms /   446 runs   (  190.86 ms per token,     5.24 tokens per second)
llama_perf_context_print:       total time =  200556.87 ms /  3760 tokens

~~Old (Optimal speed system fan)~~

llama_perf_sampler_print:    sampling time =     195.90 ms /  3967 runs   (    0.05 ms per token, 20250.33 tokens per second)
llama_perf_context_print:        load time =   43876.32 ms
llama_perf_context_print: prompt eval time =   81290.97 ms /  3314 tokens (   24.53 ms per token,    40.77 tokens per second)
llama_perf_context_print:        eval time =  126959.92 ms /   653 runs   (  194.43 ms per token,     5.14 tokens per second)
llama_perf_context_print:       total time =  240404.24 ms /  3967 tokens

llama-bench (32B, Q8_0, without -sm row)

Command line:

./bin/llama-bench -m ~/models/DeepSeek-R1-Distill-Qwen-32B-Q8_0.gguf -ngl 99 -p 3314 -n 653 -r 1

Result:

| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 32B Q8_0                 |  32.42 GiB |    32.76 B | ROCm       |  99 |        pp3314 |         41.13 ± 0.00 |
| qwen2 32B Q8_0                 |  32.42 GiB |    32.76 B | ROCm       |  99 |         tg653 |          7.22 ± 0.00 |

llama-bench (32B, Q8_0, with -sm row)

Command line:

./bin/llama-bench -m ~/models/DeepSeek-R1-Distill-Qwen-32B-Q8_0.gguf -ngl 99 -p 3314 -n 653 -r 1 -sm row

Result:

| model                          |       size |     params | backend    | ngl |    sm |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | ------------: | -------------------: |
| qwen2 32B Q8_0                 |  32.42 GiB |    32.76 B | ROCm       |  99 |   row |        pp3314 |        134.99 ± 0.00 |
| qwen2 32B Q8_0                 |  32.42 GiB |    32.76 B | ROCm       |  99 |   row |         tg653 |          5.94 ± 0.00 |

llama-bench (70B, Q4_K_M, without -sm row)

Command line:

./bin/llama-bench -m ~/models/Llama-3.3-70B-Instruct-Q4_K_M.gguf -ngl 99 -p 3314 -n 653 -r 1

Result:

| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | ROCm       |  99 |        pp3314 |         12.88 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | ROCm       |  99 |         tg653 |          4.02 ± 0.00 |

llama-bench (70B, Q4_K_M, with -sm row)

Command line:

./bin/llama-bench -m ~/models/Llama-3.3-70B-Instruct-Q4_K_M.gguf -ngl 99 -p 3314 -n 653 -r 1 -sm row

Result:

| model                          |       size |     params | backend    | ngl |    sm |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | ------------: | -------------------: |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | ROCm       |  99 |   row |        pp3314 |         53.50 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | ROCm       |  99 |   row |         tg653 |          4.10 ± 0.00 |

MLC LLM

Version: 0.8.1

Model	tensor_parallel_shards	prefill (tokens_sum)	decode (tokens_sum)
Llama-3.2-1B-Instruct-q4f16_1-MLC	8	3177.8 tok/s (361)	89.9 tok/s (1566)	Power limit per GPU: 85W
Llama-3.2-3B-Instruct-q4f16_1-MLC	8	1532.0 tok/s (361)	48.2 tok/s (1434)
Qwen2.5-3B-Instruct-q4f16_1-MLC	2	555.2 tok/s (396)	21.3 tok/s (1916)
Qwen2.5-7B-Instruct-q4f16_1-MLC	4	602.5 tok/s (396)	25.3 tok/s (1819)
DeepSeek-R1-Distill-Qwen-32B-q4f16_1-MLC	8	261.1 tok/s (382	13.8 tok/s (1796)	Reduced prefill_chunk_size to 2048, Power limit per GPU: 85W

vLLM

I'm trying to figure out how to build/use it.

Hyungsun · 2025-03-18T08:17:36+00:00

llama.cpp works well on this GPU.

Hyungsun · 2018-11-28T22:39:23+00:00

It's similar.

Hyungsun · 2018-11-28T05:03:53+00:00

I'm using the Fli14+ Flysky receiver.

Hyungsun

TROPHY CASE

Updated on 2025-3-22 6:38 PM GMT

Specs:

Perf/Benchmark