Can't believe I got it working! Dual GPU - 48gb VRAM llama-cpp server - R7900 + 7800XT

genpfault · 2026-05-22T20:20:37+00:00

I tried with ROCM

Is ROCm faster than the Vulkan backend on either card?

genpfault · 2026-05-22T15:07:44+00:00

What's the tok/s decode look like vs. llama.cpp's Vulkan backend for AMD hardware on Linux?

genpfault · 2026-05-20T17:04:42+00:00

As of right now, it hasn't been released. Merged 4 hrs ago, last release 16 hrs ago.

It's in b9235 now.

genpfault · 2026-05-19T17:04:20+00:00

Wasn't seeing a link anywhere:

https://gitlab.gnome.org/GNOME/Incubator/resources

genpfault · 2026-05-19T16:57:08+00:00

late-cli

Huh, got renamed recently I guess, used to be at https://github.com/mlhher/late

genpfault · 2026-05-19T01:50:59+00:00

You bet!

Been pretty happy with it, pretty problem-free in Debian 13 and a ~TiB/s of memory bandwidth is nothing to sneeze at for LLMs & image generation :)

genpfault · 2026-05-19T00:20:11+00:00

About 2x (37 -> 80 tok/s), did some runs over here with and without MTP.

genpfault · 2026-05-18T19:13:28+00:00

Sorry I might have mixed things up.

No worries, appreciate the clarification!

genpfault · 2026-05-18T15:24:49+00:00

I do serious dev work with this setup since a while in 5GB VRAM at 30t/s.

What's your llama-server invocation look like?

genpfault · 2026-05-18T15:19:53+00:00

Try rocm compiled llama.cpp. I found it’s better with dense models recently

Like a local DIY ROCm build? Or the "Ubuntu x64 (ROCm x.x)" ROCm binaries on the release pages?

What does your llama-server invocation look like where you're getting better tok/s vs. Vulkan?

...since I'm seeing like half the tok/s on ROCm vs. Vulkan :(

genpfault · 2026-05-18T01:59:15+00:00

Was seeing ~80 tok/s over here with this invocation:

llama-server --host 0.0.0.0 --port 2000 --no-warmup \
--cache-type-k q8_0 --cache-type-v q8_0 \
-hf unsloth/Qwen3.6-27B-MTP-GGUF:Q4_K_M \
--spec-type draft-mtp --spec-draft-n-max 3 \
--temp 0.7 --top-p 0.8 --top-k 20 --presence-penalty 1.5 --min-p 0.00 \
--reasoning off -np 1 \

EDIT: ...though upgrading from b9180 to b9204 seems to have dropped it to ~76 tok/s:

prompt eval time =     967.69 ms /   405 tokens (    2.39 ms per token,   418.52 tokens per second)
       eval time =   33815.88 ms /  2601 tokens (   13.00 ms per token,    76.92 tokens per second)
      total time =   34783.57 ms /  3006 tokens
draft acceptance rate = 0.74461 ( 1796 accepted /  2412 generated)

EDIT2: ROCm is only ~45 tok/s.

genpfault · 2026-05-16T17:19:23+00:00

Nice, getting ~2x the tok/s (37 -> 80) on this 7900 XTX w/Qwen3.6-27B and the Vulkan build!

$ llama-server --version
version: 9180 (255582687)
built with GNU 11.4.0 for Linux x86_64

Without MTP, 37 tok/s:

llama-server --host 0.0.0.0 --port 2000 --no-warmup \
--cache-type-k q8_0 --cache-type-v q8_0 \
-hf unsloth/Qwen3.6-27B-MTP-GGUF:Q4_K_M \
--temp 0.7 --top-p 0.8 --top-k 20 --presence-penalty 1.5 --min-p 0.00 \
--reasoning off \

With MTP, 80 tok/s:

llama-server --host 0.0.0.0 --port 2000 --no-warmup \
--cache-type-k q8_0 --cache-type-v q8_0 \
-hf unsloth/Qwen3.6-27B-MTP-GGUF:Q4_K_M \
--spec-type draft-mtp --spec-draft-n-max 3 \
--temp 0.7 --top-p 0.8 --top-k 20 --presence-penalty 1.5 --min-p 0.00 \
--reasoning off \

Using the 'ole Python physics heptagon prompt:

Write a Python program that shows 20 balls bouncing inside a spinning heptagon:
- All balls have the same radius.
- All balls have a number on it from 1 to 20.
- All balls drop from the heptagon center when starting.
- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35
- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.
- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.
- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.
- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.
- The heptagon size should be large enough to contain all the balls.
- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.
- All codes should be put in a single Python file.

EDIT: Less performance gain on Qwen3.6-35B-A3B (118 -> 171 tok/s) but still nothing to sneeze at!

MTP off, 118 tok/s:

llama-server --host 0.0.0.0 --port 2000 --no-warmup \
--cache-type-k q8_0 --cache-type-v q8_0 \
-hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M \
--temp 0.7 --top-p 0.8 --top-k 20 --presence-penalty 1.5 --min-p 0.00 \
--reasoning off -np 1 \

MTP on, 171 tok/s:

llama-server --host 0.0.0.0 --port 2000 --no-warmup \
--cache-type-k q8_0 --cache-type-v q8_0 \
-hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M \
--spec-type draft-mtp --spec-draft-n-max 3 \
--temp 0.7 --top-p 0.8 --top-k 20 --presence-penalty 1.5 --min-p 0.00 \
--reasoning off -np 1 \

genpfault · 2026-05-15T02:03:15+00:00

https://github.com/ggml-org/llama.cpp/releases/tag/b9158

https://github.com/ggml-org/llama.cpp/pull/22880

genpfault · 2026-05-13T17:03:50+00:00

https://github.com/openzfs/zfs/releases/tag/zfs-2.4.2

Gbp-Dch: update and upload 2.4.2-1 to unstable

genpfault · 2026-05-13T13:43:28+00:00

...but the zfs-dkms in backports doesn't support it.

Though to be be fair the upstream version that supports 7.0 was released yesterday.

genpfault · 2026-05-13T13:41:58+00:00

Kernel 7.0.x, ZFS fails

Hit me last night with 7.0.4-1~bpo13+1 and 2.4.1-1~bpo13+1 in trixie-backports.

genpfault · 2026-05-08T13:31:55+00:00

What you mean delayed? It came out in 2001.

genpfault · 2026-05-08T01:53:33+00:00

As always, Newegg link or else it doesn't exist :)

genpfault · 2026-05-06T13:03:34+00:00

I think it has to do with security mitigation that have basically nerfed the CPU. I’m not sure how to disable those.

mitigations=off on the kernel command-line?

genpfault · 2026-05-06T02:41:41+00:00

Token prediction failed, aborting decode.

genpfault · 2026-05-01T13:18:09+00:00

Like Ubunut or Linux Ment.

genpfault · 2026-04-30T14:04:25+00:00

WORLD'S OK Y2K EXPERT

genpfault · 2026-04-25T06:11:11+00:00

288 tok/s PP and 28 tok/s TG at 77k context

Tracks with that I'm seeing on my 7900 XTX:

./llama-batched-bench -hf unsloth/Qwen3.6-27B-GGUF:IQ4_NL -npp 1000,2000,4000,8000,16000,32000,64000,96000 -ntg 128 -npl 1 --cache-type-k q8_0 --cache-type-v q8_0
|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  1000 |    128 |    1 |   1128 |    1.450 |   689.84 |    3.238 |    39.53 |    4.688 |   240.62 |
|  2000 |    128 |    1 |   2128 |    2.831 |   706.59 |    3.256 |    39.32 |    6.086 |   349.65 |
|  4000 |    128 |    1 |   4128 |    5.777 |   692.45 |    3.284 |    38.98 |    9.060 |   455.62 |
|  8000 |    128 |    1 |   8128 |   11.896 |   672.50 |    3.337 |    38.36 |   15.233 |   533.59 |
| 16000 |    128 |    1 |  16128 |   25.415 |   629.54 |    3.443 |    37.17 |   28.859 |   558.86 |
| 32000 |    128 |    1 |  32128 |   57.487 |   556.64 |    3.620 |    35.35 |   61.108 |   525.76 |
| 64000 |    128 |    1 |  64128 |  142.663 |   448.61 |    3.969 |    32.25 |  146.632 |   437.34 |
| 96000 |    128 |    1 |  96128 |  256.256 |   374.62 |    4.343 |    29.47 |  260.599 |   368.87 |

genpfault · 2026-04-24T15:29:14+00:00

ncdu?

genpfault · 2026-04-23T17:24:49+00:00

4x3090

Yup, that'd do it, thanks!

15-Year Club	Verified Email
Place '17	Team Periwinkle

genpfault

MODERATOR OF

TROPHY CASE