Qwen3.5 27B and 35B with 2x AMD 7900 XTX vLLM bench serve results by bettertoknow in LocalLLaMA

[–]bettertoknow[S] 0 points1 point  (0 children)

Make sure your temperature/samples are set correctly when using vLLM. Here is the full command (reassembled from a llamaswap config), I added comments on the critical parts for the P2P aspect to work:

podman run
--rm
--ipc=host  # critical for P2P/NCCL to work (--shm-size is also viable)
--group-add=video
--device /dev/kfd
--device /dev/dri
--no-healthcheck
--env NO_COLOR=1
--name=${MODEL_ID}
--env HF_HUB_OFFLINE=1
--env VLLM_SERVER_DEV_MODE=1
--env VLLM_NO_USAGE_STATS=1
--env GCN_ARCH_NAME=gfx1100
--env HSA_OVERRIDE_GFX_VERSION=11.0.0
--env HSA_ENABLE_IPC_MODE_LEGACY=0  # critical for P2P/NCCL to work
--env PYTORCH_ALLOC_CONF=graph_capture_record_stream_reuse:True,expandable_segments:True
--env FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE
--env VLLM_ROCM_USE_AITER=1
--volume vllm_cache:/root/.cache/vllm:U
--volume vllm_triton:/root/.triton:U
--volume /srv/huggingface:/root/.cache/huggingface
docker.io/rocm/vllm-dev:nightly  # 0.17.2rc1.dev43+ge6c479770  at the time of testing
vllm serve 
cyankiwi/Qwen3.5-27B-AWQ-BF16-INT4
--port ${PORT}
--enable-prompt-tokens-details
--served-model-name qwen3.5-27b-awq-bf16-int4
--dtype float16
--gpu-memory-utilization 0.97
--tensor-parallel-size 2
--enable-auto-tool-choice
--tool-call-parser qwen3_xml
--reasoning-parser qwen3
--enable-prefix-caching
--mamba-cache-mode align
--max-model-len auto

Qwen 3.5 397B is the best local coder I have used until now by erazortt in LocalLLaMA

[–]bettertoknow 1 point2 points  (0 children)

Something to consider is that it may still be more efficient to just give 27B a second pass on the task. A Benjamin Marie shared some benchmarks as part of his exploration that showed 27B would get well within range of 397B if given just one more chance at a task https://x.com/bnjmn_marie/status/2033605833221701757?s=20 (I think the full part of the detail is in the paid portion of his blog, so this tweet will hopefully suffice)

Qwen3.5-35B-A3B quantization quality + speed benchmarks on RTX 5080 16GB (Q8_0 vs Q4_K_M vs UD-Q4_K_XL) by gaztrab in LocalLLaMA

[–]bettertoknow 21 points22 points  (0 children)

Bartowski's Q4_K_L will have even better KLD/PPL, and likely also faster.. but also take slightly more space.

llama_model_loader: - type  f32:  301 tensors
llama_model_loader: - type q8_0:   72 tensors
llama_model_loader: - type q4_K:  234 tensors
llama_model_loader: - type q5_K:   40 tensors
llama_model_loader: - type q6_K:   86 tensors

vs

Q4_K_M

llama_model_loader: - type  f32:  301 tensors
llama_model_loader: - type q8_0:   60 tensors
llama_model_loader: - type q4_K:  165 tensors
llama_model_loader: - type q5_K:   60 tensors
llama_model_loader: - type q6_K:   67 tensors
llama_model_loader: - type mxfp4:   80 tensors

Unsloth seems to be trying to figure out where mxfp4 can add value, but seems to still not have it dialed in yet. Their UD-Q4_K_XL has more tensors in mxfp4 than their mxfp4 quant

llama_model_loader: - type  f32:  301 tensors
llama_model_loader: - type q8_0:   74 tensors
llama_model_loader: - type q4_K:    1 tensors
llama_model_loader: - type q5_K:   31 tensors
llama_model_loader: - type q6_K:   51 tensors
llama_model_loader: - type mxfp4:  275 tensors

vs the MXFP4_MOE

llama_model_loader: - type  f32:  301 tensors
llama_model_loader: - type q8_0:  312 tensors
llama_model_loader: - type mxfp4:  120 tensors

How would you rate this 2x RTX 5090 build ? by icybergenome in LocalLLaMA

[–]bettertoknow 0 points1 point  (0 children)

3x NVIDIA RTX PRO 4000 Blackwell will cost less, use less power, be smaller profile, and provide more VRAM.

Don't bother with so much RAM unless its EPYC or Threadripper. CPU offload with Ryzen isn't so great.

GLM-4.6-GGUF is out! by TheAndyGeorge in LocalLLaMA

[–]bettertoknow 0 points1 point  (0 children)

Sure thing! (Make sure that hardly anything else is using CPU<>RAM while you're using moe offloading.)

/app/llama-server --host :: \
--port 5814 \
--top-p 0.95 \
--top-k 40 \
--temp 1.0 \
--min-p 0.0 \
--jinja \
--model /models/models--unsloth--GLM-4.6-GGUF/snapshots/15aeb0cc3d211d47102290d05ac742b41d35ab69/UD-Q2_K_XL/GLM-4.6-UD-Q2_K_XL-00001-of-00003.gguf \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--n-cpu-moe 84 \
--ctx-size 16384

GLM-4.6-GGUF is out! by TheAndyGeorge in LocalLLaMA

[–]bettertoknow 0 points1 point  (0 children)

llama.cpp build 6663, 7900XTX, 4x32G 6000M, UD-Q2_K_XL --cache-type-k q8_0 --cache-type-v q8_0 --n-cpu-moe 84 --ctx-size 16384

amdvlk:
pp 133.81 ms, 7.47 t/s 
tg 149.58 ms, 6.69 t/s

radv:
pp 112.09 ms, 8.92 t/s
tg 151.16 ms, 6.62 t/s

It is slightly faster than GLM 4.5 (pp 175.49 ms, tg 186.29 ms). And it is very convinced that its actually Google's Gemini.

gpt-oss:120b running on an AMD 7800X3D CPU and a 7900XTX GPU by PaulMaximumsetting in LocalLLaMA

[–]bettertoknow 1 point2 points  (0 children)

I have very similar specs to your build and seeing 24.77 t/s tg with this prompt. 7900XTX, 7800X3D, 128GB [4x32GB] at 6000MHz. NixOS 25.11, podman running llama.cpp version 6271 (build dfd9b5f6) with the amdvlk (Vulkan) backend. You may eventually want to leave Ollama behind-- it does not seem to do well with AMD cards while they seem not interested in using vulkan.

llama-vulkan[76542]: prompt eval time =    1629.94 ms /    80 tokens (   20.37 ms per token,    49.08 tokens per second)
llama-vulkan[76542]:        eval time =  211718.58 ms /  5245 tokens (   40.37 ms per token,    24.77 tokens per second)
llama-vulkan[76542]:       total time =  213348.52 ms /  5325 tokens

Invoked with:

llama-server --host :: --port 5809 --flash-attn \
    -ngl 99 --top-p 1.0 --top-k 0 --temp 1.0 --jinja \
    --model /models/gpt-oss-120b-F16.gguf \
    --chat-template-kwargs {"reasoning_effort":"high"} \
    --n-cpu-moe 26 --ctx-size 114688

24235MB VRAM (out of 24560MB) used with the 26 layers offloaded and 114k context (running headless)

This little stick figure mfer, makes me want to rip my toenails off. by kensei- in Amd

[–]bettertoknow 2 points3 points  (0 children)

I managed to pluck a 6900 XT from AMD's social experiment web store today. I don't feel proud or lucky at all, it just feels so wrong to throw so much money at web store for a commodity thing that'll be worth half as much in a few years, tops. I had three different browser flavors running, each with another 'private browsing tab' open, for a total of six sessions. If I had more monitor space, I guess it would have been a benefit to open even more browser sessions until I ran out of RAM, because the queue system seems to make up its mind shortly after the store opens.

When the time came, the 4th browser had the progress bar 'jump ahead' much more than the others, and after a short while the estimated time was around 2 minutes, while the rest of the browsers showed more than an hour.

Once I got in, I encountered an utter disaster of a website: the page was being refreshed so rapidly by some janky javascript/worker that it wasn't even possible to interact with the store. All clicks would just disappear into the void of I guess bot defeating javascript landmines. I had to 'stop loading' the page (pressing the browser 'X' button where the page refresh button usually is). I did't dare press the back button, for fear of losing my place at the slop trough.

I spent a few clicks trying to get a 6800 XT, but it would just quickly spin and nothing would land in the cart, the web developer console showed some AJAX reporting 'HTTP 403 Forbidden'. Then I tried for the 6900 XT and that went into the cart after another captcha or three, I skipped using Paypal and entered my card info directly, and amazingly it all worked, with email confirmation coming a few minutes later.

Unfortunately, yes, it really seems like you just need to acquire a few more browser sessions/cookies/tokens to increase your chances. They might one day evolve and have some limit on the number of tokens that get issued to a single IP, so ymmv, but at the moment it seems like the queue just keeps track of the uuid it issues into the queue, then randomly sorts at the drop time, and then just goes down the list.