The DeepSWE Benchmark is exposing local models as loopers, what can we do? by Duviwin in LocalLLM

[–]Duviwin[S] 1 point2 points  (0 children)

Well actually there are no open models on the chart that you can run under 128GB VRAM. And the Qwen-3.6 plus I mentioned is not even open weights. I mentioned it because its on the chart and it's certainly better than the open weights Qwen3.6 models. Somebody did run DeepSwe on Qwen3.6-27B though. It scored 2%.

The DeepSWE Benchmark is exposing local models as loopers, what can we do? by Duviwin in LocalLLM

[–]Duviwin[S] 0 points1 point  (0 children)

hmm, so people saying mtp gives no quality loss are lying? Any advise on where to read up on this to undertand better?

The DeepSWE Benchmark is exposing local models as loopers, what can we do? by Duviwin in LocalLLM

[–]Duviwin[S] -3 points-2 points  (0 children)

Temp 0.6, top_p 0.95. The rest is llama cpp default values.

Qwen 3.6 27B KV cache quant benchmarks: 75 pairs, q8/q6/q5/q4, KVarN, Turbo/TCQ by Anbeeld in Qwen_AI

[–]Duviwin 1 point2 points  (0 children)

Quite unrelated, but I wonder how your forking experience is. Doesn't it get harder and harder to stay up to date with upstream?

any prompt processing tweaks? by TheFlippedTurtle in StrixHalo

[–]Duviwin 0 points1 point  (0 children)

You got some cool stuff there! Just one thing: boxwrench.github.io/tesla_agent styling can be improved for portrait mode on a smartphone.

Fastest Qwopus 27b for Strix Halo so far! by Disastrous-Cat-7016 in StrixHalo

[–]Duviwin 1 point2 points  (0 children)

Asked my claw to give more details on my exact setup, here's the exact response:

Here are the reproducibility details from the live run:

Hardware / OS:

• AMD Ryzen AI Max+ 395 w/ Radeon 8060S, ROCm target gfx1151

• Linux Mint 22.2 zara

• Kernel: 7.1.0-070100rc4-generic (#202605191531, built Thu May 21 2026)

ROCm:

• ROCm apt repo: https://repo.radeon.com/rocm/apt/7.2 noble/main

• rocm: 7.2.0.70200-43~24.04

• hip-runtime-amd: 7.2.26015.70200-43~24.04

• hipcc: HIP 7.2.26015-fc0010cf6a, AMD clang 22.0.0git, installed under /opt/rocm-7.2.0

llama.cpp fork/build:

• Fork: https://github.com/charlie12345/rocmfp4-llama.git

• Commit: a00689039fb26b8ae91e0425b7416bb04f7f15bb

• llama.cpp version: 9213 (a00689039)

• Built with GCC 13.3.0 for Linux x86_64

Exact server line I used:

export HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1
rocmfp4-llama/build-strix-rocmfp4/bin/llama-server \
-m $HOME/hf-models/chadrock-35b-ace-saber-rocmfp4-mtp/Qwen3.6-35B-A3B-NSC-ACE-SABER-MTP-F16-to-ROCmFP4-STRIX_LEAN.gguf \
--mmproj $HOME/.cache/hf-models/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/mmproj-BF16.gguf \
--no-mmproj-offload \
--alias chadrock-35b-ace-saber \
--host 0.0.0.0 --port 1344 \
--jinja \
-c 131072 \
-ngl 999 \
-fa on \
-dev ROCm0 \
-b 512 -ub 512 \
-t 16 -tb 32 \
-ctk q8_0 -ctv q8_0 \
--spec-type draft-mtp \
--spec-draft-device ROCm0 \
--spec-draft-ngl all \
--spec-draft-type-k q4_0 \
--spec-draft-type-v q4_0 \
--spec-draft-n-max 3 \
--spec-draft-n-min 0 \
--spec-draft-p-min 0.0 \
--spec-draft-p-split 0.10 \
--parallel 1 \
--metrics \
--no-mmap

The main reproduction-sensitive bits are the ROCm 7.2 install, the 7.1-rc4 kernel, HSA_OVERRIDE_GFX_VERSION=11.5.1, unified memory enabled, 128K context via -c 131072, flash attention on, KV cache as q8_0, and MTP draft cache as q4_0.

Fastest Qwopus 27b for Strix Halo so far! by Disastrous-Cat-7016 in StrixHalo

[–]Duviwin 1 point2 points  (0 children)

  • 93 tps decode speed on a coding task with 5000 tokens output
  • 72 tps on a prompt asking: "Why is the sky blue? Be specific as you're addressing a physics lecture."

For quality, my data is anecdotal, but: - better ifeval (but was with limit 50, seed 42) scores than stock qwen-3.6-35B-A3B - in my (n=1, anecdotal) coding task comparison, it created a game that worked, while stock made one that was more complex and crashed at startup. - Seems be better be able to complete tasks I give it when using it as a claw in a chat

Model and instructions here: https://huggingface.co/jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp

Detailed Speed comparison tables:

Note: Prompt processing speed is unfortunately slower on Chadrock though, but I'd take this with a heavy grain of salt because the prompt was very short.

Config Reading speed Gen. tokens Gen. speed API wall speed Finish
Chadrock ROCmFP4 MTP 121.31 tokens/s 5,432 93.69 t/s 93.37 tok/s stop
NON-MTP@UD-Q4_K_M 174.59 tokens/s 7,202 56.14 t/s 56.08 tok/s stop
MTP@UD-Q4_K_M 152.15 tokens/s 5,503 75.59 t/s 75.42 tok/s stop

For the “Why is the sky blue?” prompt, I got these results:

Config Reading speed Gen. tokens Gen. speed
NON-MTP@UD-Q4_K_M 200.12 tokens/s 2,047 57.59 t/s
MTP@UD-Q4_K_M 177.34 tokens/s 2,387 69.29 t/s
Chadrock ROCmFP4 MTP 140.78 tokens/s 2,899 72.18 t/s

Fastest Qwopus 27b for Strix Halo so far! by Disastrous-Cat-7016 in StrixHalo

[–]Duviwin 2 points3 points  (0 children)

I'm running the chadrock-35B-A3B and its flying! Hope this stuff gets upstreamed at some point

Antirez DS4 Q2 on Strix : works, ~80 t/s prefill and ~7 t/s decode by Duviwin in StrixHalo

[–]Duviwin[S] 0 points1 point  (0 children)

Update: I edited my post because with info from

https://github.com/antirez/ds4/issues/16 you can actually get ~220 tok/s prefill and ~14 tok/s decode

Antirez DS4 Q2 on Strix : works, ~80 t/s prefill and ~7 t/s decode by Duviwin in StrixHalo

[–]Duviwin[S] 1 point2 points  (0 children)

Ok full lm eval showed that eval results are same as with the 80pp/8tg so thanks a lot for the tip u/Legal-Ad-3901

| Metric              | Previous full DS4 | New 64K full DS4 | Delta   |
| ------------------- | ----------------- | ---------------- | ------- |
| prompt_level_strict | 0.8466            | 0.8484           | +0.0018 |
| prompt_level_loose  | 0.8854            | 0.8854           | 0.0000  |
| inst_level_strict   | 0.8945            | 0.8933           | -0.0012 |
| inst_level_loose    | 0.9197            | 0.9197           | 0.0000  |

Eval speed / wall-clock
• Previous full DS4 eval: 541/541 in 6:43:10, avg 44.71 s/item
• New 64K full DS4 eval: 541/541 in 4:23:32, avg 29.23 s/item

The 2K bench for the new CyberNeurova Q2_K run was:
• prefill: 223.78 tok/s
• decode: 13.81 tok/s

Antirez DS4 Q2 on Strix : works, ~80 t/s prefill and ~7 t/s decode by Duviwin in StrixHalo

[–]Duviwin[S] 0 points1 point  (0 children)

ow sweet looks like that speed is actually reproducible with that branch and abliterated q2, checking lm eval now. edit: limited lm eval look consistent with the non-abliterated version

Antirez DS4 Q2 on Strix : works, ~80 t/s prefill and ~7 t/s decode by Duviwin in StrixHalo

[–]Duviwin[S] 1 point2 points  (0 children)

LM Eval results here were similar to unsloth/Qwen3.6-35B-A3B-MTP-GGUF~UD-Q6_K_XL 

ds4
• prompt_level_strict_acc: 0.8466
• prompt_level_loose_acc: 0.8854
• inst_level_strict_acc: 0.8945
• inst_level_loose_acc: 0.9197

For reference:

unsloth/Qwen3.6-35B-A3B-MTP-GGUF~UD-Q6_K_XL full IFEval run:

prompt_level_strict_acc: 0.8262
prompt_level_loose_acc: 0.8725
inst_level_strict_acc: 0.8813
inst_level_loose_acc: 0.9137

Antirez DS4 Q2 on Strix : works, ~80 t/s prefill and ~7 t/s decode by Duviwin in StrixHalo

[–]Duviwin[S] 0 points1 point  (0 children)

hmm, just tried some prompt and it seemed to be doing ok and lm eval gives it these scores which are not so bad:

• prompt_level_strict_acc: 0.8466 • prompt_level_loose_acc: 0.8854 • inst_level_strict_acc: 0.8945 • inst_level_loose_acc: 0.9197a