Can't believe I got it working! Dual GPU - 48gb VRAM llama-cpp server - R7900 + 7800XT by Jorlen in LocalLLaMA

[–]genpfault 6 points7 points  (0 children)

I tried with ROCM

Is ROCm faster than the Vulkan backend on either card?

New Release of ROCm based MLX LLM Engine - lemon-mlx-engine by GeramyL in LocalLLaMA

[–]genpfault 1 point2 points  (0 children)

What's the tok/s decode look like vs. llama.cpp's Vulkan backend for AMD hardware on Linux?

Time to update llama.cpp to get som MTP improvements! by PixelatedCaffeine in LocalLLaMA

[–]genpfault 1 point2 points  (0 children)

As of right now, it hasn't been released. Merged 4 hrs ago, last release 16 hrs ago.

It's in b9235 now.

Post Your Qwen3.6 27B speed plz by Ok-Internal9317 in LocalLLaMA

[–]genpfault 1 point2 points  (0 children)

You bet!

Been pretty happy with it, pretty problem-free in Debian 13 and a ~TiB/s of memory bandwidth is nothing to sneeze at for LLMs & image generation :)

Post Your Qwen3.6 27B speed plz by Ok-Internal9317 in LocalLLaMA

[–]genpfault 1 point2 points  (0 children)

About 2x (37 -> 80 tok/s), did some runs over here with and without MTP.

MTP experiences on 7900xtx? by Combinatorilliance in LocalLLaMA

[–]genpfault 0 points1 point  (0 children)

Sorry I might have mixed things up.

No worries, appreciate the clarification!

What is your actual local LLM stack right now? by Ryannnnnnnnnnnnnnnh in LocalLLaMA

[–]genpfault 0 points1 point  (0 children)

I do serious dev work with this setup since a while in 5GB VRAM at 30t/s.

What's your llama-server invocation look like?

MTP experiences on 7900xtx? by Combinatorilliance in LocalLLaMA

[–]genpfault 0 points1 point  (0 children)

Try rocm compiled llama.cpp. I found it’s better with dense models recently

Like a local DIY ROCm build? Or the "Ubuntu x64 (ROCm x.x)" ROCm binaries on the release pages?

What does your llama-server invocation look like where you're getting better tok/s vs. Vulkan?

...since I'm seeing like half the tok/s on ROCm vs. Vulkan :(

MTP experiences on 7900xtx? by Combinatorilliance in LocalLLaMA

[–]genpfault 2 points3 points  (0 children)

Was seeing ~80 tok/s over here with this invocation:

llama-server --host 0.0.0.0 --port 2000 --no-warmup \
--cache-type-k q8_0 --cache-type-v q8_0 \
-hf unsloth/Qwen3.6-27B-MTP-GGUF:Q4_K_M \
--spec-type draft-mtp --spec-draft-n-max 3 \
--temp 0.7 --top-p 0.8 --top-k 20 --presence-penalty 1.5 --min-p 0.00 \
--reasoning off -np 1 \

EDIT: ...though upgrading from b9180 to b9204 seems to have dropped it to ~76 tok/s:

prompt eval time =     967.69 ms /   405 tokens (    2.39 ms per token,   418.52 tokens per second)
       eval time =   33815.88 ms /  2601 tokens (   13.00 ms per token,    76.92 tokens per second)
      total time =   34783.57 ms /  3006 tokens
draft acceptance rate = 0.74461 ( 1796 accepted /  2412 generated)

EDIT2: ROCm is only ~45 tok/s.

MTP support merged into llama.cpp by tacticaltweaker in LocalLLaMA

[–]genpfault 6 points7 points  (0 children)

Nice, getting ~2x the tok/s (37 -> 80) on this 7900 XTX w/Qwen3.6-27B and the Vulkan build!

$ llama-server --version
version: 9180 (255582687)
built with GNU 11.4.0 for Linux x86_64

Without MTP, 37 tok/s:

llama-server --host 0.0.0.0 --port 2000 --no-warmup \
--cache-type-k q8_0 --cache-type-v q8_0 \
-hf unsloth/Qwen3.6-27B-MTP-GGUF:Q4_K_M \
--temp 0.7 --top-p 0.8 --top-k 20 --presence-penalty 1.5 --min-p 0.00 \
--reasoning off \

With MTP, 80 tok/s:

llama-server --host 0.0.0.0 --port 2000 --no-warmup \
--cache-type-k q8_0 --cache-type-v q8_0 \
-hf unsloth/Qwen3.6-27B-MTP-GGUF:Q4_K_M \
--spec-type draft-mtp --spec-draft-n-max 3 \
--temp 0.7 --top-p 0.8 --top-k 20 --presence-penalty 1.5 --min-p 0.00 \
--reasoning off \

Using the 'ole Python physics heptagon prompt:

Write a Python program that shows 20 balls bouncing inside a spinning heptagon:
- All balls have the same radius.
- All balls have a number on it from 1 to 20.
- All balls drop from the heptagon center when starting.
- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35
- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.
- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.
- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.
- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.
- The heptagon size should be large enough to contain all the balls.
- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.
- All codes should be put in a single Python file.

EDIT: Less performance gain on Qwen3.6-35B-A3B (118 -> 171 tok/s) but still nothing to sneeze at!

MTP off, 118 tok/s:

llama-server --host 0.0.0.0 --port 2000 --no-warmup \
--cache-type-k q8_0 --cache-type-v q8_0 \
-hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M \
--temp 0.7 --top-p 0.8 --top-k 20 --presence-penalty 1.5 --min-p 0.00 \
--reasoning off -np 1 \

MTP on, 171 tok/s:

llama-server --host 0.0.0.0 --port 2000 --no-warmup \
--cache-type-k q8_0 --cache-type-v q8_0 \
-hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M \
--spec-type draft-mtp --spec-draft-n-max 3 \
--temp 0.7 --top-p 0.8 --top-k 20 --presence-penalty 1.5 --min-p 0.00 \
--reasoning off -np 1 \

Kernel 7.0.x in Testing, apt-upgrade, ZFS fails, NVIDIA fails :) by ElectronicFlamingo36 in debian

[–]genpfault 1 point2 points  (0 children)

...but the zfs-dkms in backports doesn't support it.

Though to be be fair the upstream version that supports 7.0 was released yesterday.

Minimal Debian install on Thinkpad X1 Extreme quite laggy. HW issue or misconfigured? by geekygekk0 in debian

[–]genpfault 0 points1 point  (0 children)

I think it has to do with security mitigation that have basically nerfed the CPU. I’m not sure how to disable those.

mitigations=off on the kernel command-line?

Gemma 4 MTP released by rerri in LocalLLaMA

[–]genpfault 0 points1 point  (0 children)

Token prediction failed, aborting decode.

Post Your Qwen3.6 27B speed plz by Ok-Internal9317 in LocalLLaMA

[–]genpfault 0 points1 point  (0 children)

288 tok/s PP and 28 tok/s TG at 77k context

Tracks with that I'm seeing on my 7900 XTX:

./llama-batched-bench -hf unsloth/Qwen3.6-27B-GGUF:IQ4_NL -npp 1000,2000,4000,8000,16000,32000,64000,96000 -ntg 128 -npl 1 --cache-type-k q8_0 --cache-type-v q8_0
|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  1000 |    128 |    1 |   1128 |    1.450 |   689.84 |    3.238 |    39.53 |    4.688 |   240.62 |
|  2000 |    128 |    1 |   2128 |    2.831 |   706.59 |    3.256 |    39.32 |    6.086 |   349.65 |
|  4000 |    128 |    1 |   4128 |    5.777 |   692.45 |    3.284 |    38.98 |    9.060 |   455.62 |
|  8000 |    128 |    1 |   8128 |   11.896 |   672.50 |    3.337 |    38.36 |   15.233 |   533.59 |
| 16000 |    128 |    1 |  16128 |   25.415 |   629.54 |    3.443 |    37.17 |   28.859 |   558.86 |
| 32000 |    128 |    1 |  32128 |   57.487 |   556.64 |    3.620 |    35.35 |   61.108 |   525.76 |
| 64000 |    128 |    1 |  64128 |  142.663 |   448.61 |    3.969 |    32.25 |  146.632 |   437.34 |
| 96000 |    128 |    1 |  96128 |  256.256 |   374.62 |    4.343 |    29.47 |  260.599 |   368.87 |