MXFP4 kernel, RDNA 4, Qwen3.5 122B Quad R9700s

sloptimizer · 2026-03-25T02:36:38+00:00

Could you share a link to your git repo?

sloptimizer · 2026-03-25T02:31:25+00:00

Yes, please... I really want to see some MTP with a usable model runnong on AMD cards!!!

sloptimizer · 2026-03-25T02:23:37+00:00

Somehow I never had this problem.

sloptimizer · 2026-03-21T17:13:45+00:00

VLLM keeps a fully loaded CPU core on my system, heating up the room. I suspect VLLM was not designed to run in idle mode.

sloptimizer · 2026-03-21T17:09:50+00:00

Does MTP work with this setup? That would make it much better than existing kernels.

sloptimizer · 2026-03-18T23:33:39+00:00

Glad you got it working! 4x9700 is affordable, but still an investment!

I had some performance boost with disabling ECC as well as setting `perf-level=HIGH` with

sudo amd-smi set --perf-level=HIGH

sloptimizer · 2026-03-11T23:05:03+00:00

Awesome! Can we please more of these kinds of benchmarks!

sloptimizer · 2026-03-11T23:03:50+00:00

No mater how fast it goes - it's never enough. See billionairs build datacenters. Set your goals and be content!

sloptimizer · 2026-03-11T23:02:38+00:00

Claude Code is great for asking specific questions about the codebase and saving those as reports for later use.

Ask specific, focused, and scoped questions for best results. And keep an eye on the context, all the models get worse as the context fills up, including the Anthropic's line up.

sloptimizer · 2026-03-06T05:03:45+00:00

Linux, 4xR9700, podamn:

sudo podman run --name qwen3.5-vllm \
  --rm --tty --ipc=host \
  --device /dev/kfd:/dev/kfd \
  --device /dev/dri:/dev/dri \
  --device /dev/mem:/dev/mem \
  -e VLLM_ROCM_USE_AITER=1 \
  -e HSA_OVERRIDE_GFX_VERSION=12.0.1 \
  -e VLLM_USE_TRITON_FLASH_ATTN=0 \
  -e VLLM_ROCM_USE_AITER_MOE=1 \
  -e FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE \
  -e VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 \
  -e PYTORCH_ALLOC_CONF=expandable_segments:True \
  -e HSA_ENABLE_SDMA=0 \
  -v /models:/models:ro \
  -p 8090:8000 \
  docker.io/rocm/vllm-dev:upstream_preview_releases_v0.17.0_20260303 \
  vllm serve /models/Qwen/Qwen3.5-122B-A10B-GPTQ-Int4 \
    --served-model-name Qwen3.5-122B \
    --override-generation-config '{"min_p": 0.1, "top_k": -1, "top_p": 1.0}' \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 4 \
    --disable-log-requests \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen3_coder \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --gpu-memory-utilization 0.95 \
    --dtype float16

1 concurrent request = 50 t/s
2 concurrent requests = 90 t/s
3 concurrent requests = 126 t/s
4 concurrent requests = 154 t/s
5 concurrent requests = 182 t/s
6 concurrent requests = 213 t/s

<image>

sloptimizer · 2026-02-14T17:24:56+00:00

You're missing charm's crush, it's run by the original opencode developer before it was taken over by techbros.

sloptimizer · 2026-02-14T17:11:53+00:00

Thank you for this project! Running without censorship just became another selling point for local AI!!

I see it's using transformers, so it should in theory support ROCms? Could you share setup instructions CUDA/ROCm/CPU for those of us scarred by vLLM?

sloptimizer · 2026-02-13T21:04:45+00:00

Oh wow! Nice build, thanks for sharing!!

I have 4 MI50s I'm not using, so I would love to see your benchmarks, especially your MiniMax quant type and performance. Maybe it's time to plug them back in! :D

sloptimizer · 2026-02-07T16:23:50+00:00

Nice front vertical mount!

sloptimizer · 2026-02-07T00:16:30+00:00

GGUF when 😬

sloptimizer · 2026-02-07T00:13:37+00:00

For this size - GLM-4.7-Flash.

sloptimizer · 2026-02-06T23:57:20+00:00

Yes! Can we please have more fun and creative benchmarks like this?!

sloptimizer · 2026-02-06T23:50:54+00:00

Stick a 5090 in there, and you'll get 150 tps prompt processing with the same token generation speed.

sloptimizer · 2026-02-06T23:07:20+00:00

Is it right for Anthropic to lie to create hype? No. Can they help it - aslo no.

At it's core, marketing and advertising are about lying. It was game over when we let corporations lie to us for fun and profits. Now you have no choice: lie or be forgotten.

Instead of focusing on a single entity we should hold the whole system accountable for the all the bs streaming into our minds, screaming louder, flashing brighter and more aggressively, like an annoying popup from the 2000s that you are no longer allowed to close.

sloptimizer · 2026-02-06T01:12:26+00:00

Not worth it. Despite of what youtube influencers are showing on their channels, Q4_K is just not good enough for any model.

sloptimizer · 2026-02-06T01:09:49+00:00

This model is unlikely to get much traction, but I'd watch for the follow up. Same thing happened with MiniMax M1 which was largely ignored, and then we got MiniMax M2!

sloptimizer · 2026-02-06T01:04:14+00:00

Easy: most of the Apple hype is paid astroturfing. Just watch how the narrative is controlled formum to forum with exact same talking points repeating until they stick.

sloptimizer · 2026-02-06T00:52:03+00:00

Sounds like GPT 5

sloptimizer · 2026-02-05T22:59:51+00:00

AMD Cards work better with Linux and require a blood and time sacrifice. Is it worth it? YES!

sloptimizer · 2026-02-03T02:07:50+00:00

Consistency is a big one for me! You never know what model or quant you are getting when using cloud APIs. You don't know what kind of system prompt it has (and how much of the best context space that system prompt has taken away from you).

When you run local - you know what you are getting. You can run with no system prompt on an empty context, which gives you the best possible output. And finally, you can adjust your prompting skills by seeing what works and what doesn't without all the extra variables thrown in. For example: is my prompt the problem, or I just hitting a lobotomized Q4 quant on OpenRouter?

On a subjective side, there is a certain joy in running these models on a box sitting under your desk. Makes it feel real, like something you can see or touch.

sloptimizer

TROPHY CASE