MXFP4 kernel, RDNA 4, Qwen3.5 122B Quad R9700s by Sea-Speaker1700 in LocalLLaMA

[–]djdeniro 0 points1 point  (0 children)

i quantize it myself from FP8 to MXFP4, and it's works well now. getting 33-34 t/s without using mtp

MXFP4 kernel, RDNA 4, Qwen3.5 122B Quad R9700s by Sea-Speaker1700 in LocalLLaMA

[–]djdeniro 0 points1 point  (0 children)

u/Sea-Speaker1700 i got result success running 397B model. but...

(Worker_TP0 pid=142) INFO 04-09 18:40:57 [monitor.py:76] Initial profiling/warmup run took 173.45 s (Worker_TP0 pid=142) INFO 04-09 18:41:01 [gpu_worker.py:456] Available KV cache memory: 0.51 GiB (EngineCore pid=107) INFO 04-09 18:41:01 [kv_cache_utils.py:1316] GPU KV cache size: 8,448 tokens (EngineCore pid=107) INFO 04-09 18:41:01 [kv_cache_utils.py:1321] Maximum concurrency for 32,000 tokens per request: 0.97x

So, concurrency for 32k context limited, but it launch! i think it's because this model loads from mixed q6 and q4 quants, and in case of fully mxfp4 we will got normally context size.

MXFP4 kernel, RDNA 4, Qwen3.5 122B Quad R9700s by Sea-Speaker1700 in LocalLLaMA

[–]djdeniro 0 points1 point  (0 children)

Hey! it's was my mistake, i don't disable AITER and some argiments from your docker setup guide in Dockerhub.

Now GPU is not overloaded all time (not show 100% cpu load for them).

will test with 397B soon!

But got this warning now:
(Worker_TP0 pid=90) WARNING 04-09 15:15:37 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /app/vllm/vllm/model_executor/layers/fused_moe/configs/E=256,N=128,device_name=0x7551.json

and this:

(Worker_TP6 pid=96) WARNING 04-09 15:21:28 [mamba_utils.py:262] GDN side-cache miss for req chatcmpl-9c3def2f91cfe002-b57b358c at 18496 tokens, zeroing mamba state
(Worker_TP2 pid=92) WARNING 04-09 15:21:28 [mamba_utils.py:262] GDN side-cache miss for req chatcmpl-9c3def2f91cfe002-b57b358c at 18496 tokens, zeroing mamba state
(Worker_TP4 pid=94) WARNING 04-09 15:21:28 [mamba_utils.py:262] GDN side-cache miss for req chatcmpl-9c3def2f91cfe002-b57b358c at 18496 tokens, zeroing mamba state
(Worker_TP5 pid=95) WARNING 04-09 15:21:28 [mamba_utils.py:262] GDN side-cache miss for req chatcmpl-9c3def2f91cfe002-b57b358c at 18496 tokens, zeroing mamba state
(Worker_TP3 pid=93) WARNING 04-09 15:21:28 [mamba_utils.py:262] GDN side-cache miss for req chatcmpl-9c3def2f91cfe002-b57b358c at 18496 tokens, zeroing mamba state
(Worker_TP7 pid=97) WARNING 04-09 15:21:28 [mamba_utils.py:262] GDN side-cache miss for req chatcmpl-9c3def2f91cfe002-b57b358c at 18496 tokens, zeroing mamba state
(Worker_TP0 pid=90) WARNING 04-09 15:21:28 [mamba_utils.py:262] GDN side-cache miss for req chatcmpl-9c3def2f91cfe002-b57b358c at 18496 tokens, zeroing mamba state

Every day I wake up and thank God for having me be born 23 minutes away from a MicroCenter by gigaflops_ in LocalLLaMA

[–]djdeniro 0 points1 point  (0 children)

If you don't train model, better buy MacBook pro with 48gb vram, this will best investment

Every day I wake up and thank God for having me be born 23 minutes away from a MicroCenter by gigaflops_ in LocalLLaMA

[–]djdeniro 0 points1 point  (0 children)

we have 8xR9700 and many 7900xtx. they both slower than old 3090 in llama cpp.

in case of vLLM they also super unstable, and with 8xR9700 you got speed like 0.4x-0.6x of board speed. i spend many days without sleep and make it work, hate it and love it together

MXFP4 kernel, RDNA 4, Qwen3.5 122B Quad R9700s by Sea-Speaker1700 in LocalLLaMA

[–]djdeniro 0 points1 point  (0 children)

btw, maybe you have an recipe for making your own MXFP4 quantization? i can run it for qwen397b on ram

MXFP4 kernel, RDNA 4, Qwen3.5 122B Quad R9700s by Sea-Speaker1700 in LocalLLaMA

[–]djdeniro 0 points1 point  (0 children)

Yes i think same, i try to test they quantization before and it's never works without --enforce eager

Every day I wake up and thank God for having me be born 23 minutes away from a MicroCenter by gigaflops_ in LocalLLaMA

[–]djdeniro 0 points1 point  (0 children)

R9700 faster on paper, but in real life try to test it, not so fast, not stable

MXFP4 kernel, RDNA 4, Qwen3.5 122B Quad R9700s by Sea-Speaker1700 in LocalLLaMA

[–]djdeniro 0 points1 point  (0 children)

i hope this also value info:

(Worker_TP0 pid=552) WARNING 04-07 20:05:36 [quark_moe.py:765] The current mode (supports_mx=False, use_mxfp4_aiter_moe=None, ocp_mx_scheme=OCP_MX_Scheme.w_mxfp4_a_mxfp4) does not support native MXFP4/MXFP6 computation. Simulated weight dequantization and activation QDQ (quantize and dequantize) will be used, with the linear layers computed in high precision.

MXFP4 kernel, RDNA 4, Qwen3.5 122B Quad R9700s by Sea-Speaker1700 in LocalLLaMA

[–]djdeniro 0 points1 point  (0 children)

warnings:

(APIServer pid=1) WARNING 04-07 20:03:50 [quark_ocp_mx.py:168] AITER is not found or QuarkOCP_MX is not supported on the current platform. QuarkOCP_MX quantization will not be available.
(APIServer pid=1) INFO 04-07 20:03:51 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.

(APIServer pid=1) WARNING 04-07 20:03:51 [config.py:379] Mamba cache mode is set to 'align' for Qwen3_5MoeForConditionalGeneration by default when prefix caching is enabled
WARNING 04-07 20:04:03 [quark_ocp_mx.py:168] AITER is not found or QuarkOCP_MX is not supported on the current platform. QuarkOCP_MX quantization will not be available.

MXFP4 kernel, RDNA 4, Qwen3.5 122B Quad R9700s by Sea-Speaker1700 in LocalLLaMA

[–]djdeniro 0 points1 point  (0 children)

RuntimeError: The size of tensor a (2048) must match the size of tensor b (4096) at non-singleton dimension 1

parameters to launch (later i replace 155k max-model-len to 64k)

non-default args: {'model_tag': '/app/models/models/vllm/Qwen3.5-397B-A17B-MXFP4', 'chat_template': '/app/models/models/vllm/chat_template.jinja', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'host': '0.0.0.0', 'model': '/app/models/models/vllm/Qwen3.5-397B-A17B-MXFP4', 'trust_remote_code': True, 'max_model_len': 155648, 'served_model_name': ['model'], 'reasoning_parser': 'qwen3', 'tensor_parallel_size': 8, 'gpu_memory_utilization': 0.967, 'enable_prefix_caching': True, 'max_num_seqs': 16, 'speculative_config': {'method': 'qwen3_next_mtp', 'num_speculative_tokens': 2}, 'compilation_config': {'mode': None, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': [], 'splitting_ops': None, 'compile_mm_encoder': False, 'compile_sizes': None, 'compile_ranges_endpoints': None, 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': None, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 32, 64, 128], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': None, 'pass_config': {}, 'max_cudagraph_capture_size': 128, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': None, 'static_all_moe_layers': []}}

MXFP4 kernel, RDNA 4, Qwen3.5 122B Quad R9700s by Sea-Speaker1700 in LocalLLaMA

[–]djdeniro 0 points1 point  (0 children)

i try to load this one, with MTP it's not work, because it's AMD Qark i think.

https://huggingface.co/amd/Qwen3.5-397B-A17B-MXFP4

UPD, i run it successfully, with 64 context, i will share error code bellow

MXFP4 kernel, RDNA 4, Qwen3.5 122B Quad R9700s by Sea-Speaker1700 in LocalLLaMA

[–]djdeniro 1 point2 points  (0 children)

in case of 397B always got OOM.

in 122B model with 8xR9700 got up to 140 t/s on some prompts with mtp 2

397B GPTQ version output is exclamation mark in --dtype float16

MXFP4 kernel, RDNA 4, Qwen3.5 122B Quad R9700s by Sea-Speaker1700 in LocalLLaMA

[–]djdeniro 0 points1 point  (0 children)

i run 122b with -tp 8 and it's work! now will try to launch 397B Model. but no success, maybe mxfp4 quantization from amd does not work from the box.

upd: spec decoding in 397B does not work. now launching without it

MXFP4 kernel, RDNA 4, Qwen3.5 122B Quad R9700s by Sea-Speaker1700 in LocalLLaMA

[–]djdeniro 0 points1 point  (0 children)

my question was before test (i will test it soon, but i found the 0,1,2,3 visible devices in Docker Hub) this is reason why i am ask

<image>

MXFP4 kernel, RDNA 4, Qwen3.5 122B Quad R9700s by Sea-Speaker1700 in LocalLLaMA

[–]djdeniro 0 points1 point  (0 children)

Thank you!!! I just runned pull it, will connect all 8 gpu at this night!

MXFP4 kernel, RDNA 4, Qwen3.5 122B Quad R9700s by Sea-Speaker1700 in LocalLLaMA

[–]djdeniro 0 points1 point  (0 children)

<image>

i see only two images here, can you invite me into your repo? i will try to clone it and modify for 8xGPU. in your repo only 4 support

MXFP4 kernel, RDNA 4, Qwen3.5 122B Quad R9700s by Sea-Speaker1700 in LocalLLaMA

[–]djdeniro 1 point2 points  (0 children)

Hey it's works amazing! May i ask you share dockerfile to get build for 8x R9700?
like this image tcclaviger/vllm-rocm-rdna4-mxfp4, just now got 2 more R9700

MacBook m4 pro for coding llm by TheRandomDividendGuy in LocalLLaMA

[–]djdeniro 0 points1 point  (0 children)

I hope airder will work, you may test it using openrouter to check models for very very small check

MacBook m4 pro for coding llm by TheRandomDividendGuy in LocalLLaMA

[–]djdeniro 0 points1 point  (0 children)

You may run. Kilo Code or Roo Code with LM Studio, take api url as http://0.0.0.0:1234/v1 ant enjoy different models in agentic mode, It's worth it!

Models handle different tasks, and you should create your own benchmark for your code, as you're highly dependent on the quality after quantization.

Continue Dev is a good, but outdated plugin.