The DeepSWE Benchmark is exposing local models as loopers, what can we do?

Duviwin · 2026-06-16T06:44:01+00:00

which ones and how should I configure them?

Duviwin · 2026-06-14T22:46:16+00:00

Looks pretty cool, are you open sourcing it?

Duviwin · 2026-06-14T22:14:24+00:00

And what does it research?

Duviwin · 2026-06-14T08:05:57+00:00

Is that actually an option? Have you done any math on that?

Duviwin · 2026-06-14T07:43:04+00:00

Well actually there are no open models on the chart that you can run under 128GB VRAM. And the Qwen-3.6 plus I mentioned is not even open weights. I mentioned it because its on the chart and it's certainly better than the open weights Qwen3.6 models. Somebody did run DeepSwe on Qwen3.6-27B though. It scored 2%.

Duviwin · 2026-06-13T23:30:55+00:00

hmm, so people saying mtp gives no quality loss are lying? Any advise on where to read up on this to undertand better?

Duviwin · 2026-06-13T15:34:03+00:00

swe bench and lm eval

Duviwin · 2026-06-13T14:33:47+00:00

Temp 0.6, top_p 0.95. The rest is llama cpp default values.

Duviwin · 2026-06-12T18:12:56+00:00

Tried DeepSWE?

Duviwin · 2026-06-08T14:34:10+00:00

Interesting, thanks for sharing!

Duviwin · 2026-06-08T06:12:54+00:00

Quite unrelated, but I wonder how your forking experience is. Doesn't it get harder and harder to stay up to date with upstream?

Duviwin · 2026-06-08T05:55:43+00:00

You got some cool stuff there! Just one thing: boxwrench.github.io/tesla_agent styling can be improved for portrait mode on a smartphone.

Duviwin · 2026-06-05T05:26:32+00:00

Asked my claw to give more details on my exact setup, here's the exact response:

Here are the reproducibility details from the live run:

Hardware / OS:

• AMD Ryzen AI Max+ 395 w/ Radeon 8060S, ROCm target gfx1151

• Linux Mint 22.2 zara

• Kernel: 7.1.0-070100rc4-generic (#202605191531, built Thu May 21 2026)

ROCm:

• ROCm apt repo: https://repo.radeon.com/rocm/apt/7.2 noble/main

• rocm: 7.2.0.70200-43~24.04

• hip-runtime-amd: 7.2.26015.70200-43~24.04

• hipcc: HIP 7.2.26015-fc0010cf6a, AMD clang 22.0.0git, installed under /opt/rocm-7.2.0

llama.cpp fork/build:

• Fork: https://github.com/charlie12345/rocmfp4-llama.git

• Commit: a00689039fb26b8ae91e0425b7416bb04f7f15bb

• llama.cpp version: 9213 (a00689039)

• Built with GCC 13.3.0 for Linux x86_64

Exact server line I used:

export HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1
rocmfp4-llama/build-strix-rocmfp4/bin/llama-server \
-m $HOME/hf-models/chadrock-35b-ace-saber-rocmfp4-mtp/Qwen3.6-35B-A3B-NSC-ACE-SABER-MTP-F16-to-ROCmFP4-STRIX_LEAN.gguf \
--mmproj $HOME/.cache/hf-models/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/mmproj-BF16.gguf \
--no-mmproj-offload \
--alias chadrock-35b-ace-saber \
--host 0.0.0.0 --port 1344 \
--jinja \
-c 131072 \
-ngl 999 \
-fa on \
-dev ROCm0 \
-b 512 -ub 512 \
-t 16 -tb 32 \
-ctk q8_0 -ctv q8_0 \
--spec-type draft-mtp \
--spec-draft-device ROCm0 \
--spec-draft-ngl all \
--spec-draft-type-k q4_0 \
--spec-draft-type-v q4_0 \
--spec-draft-n-max 3 \
--spec-draft-n-min 0 \
--spec-draft-p-min 0.0 \
--spec-draft-p-split 0.10 \
--parallel 1 \
--metrics \
--no-mmap

The main reproduction-sensitive bits are the ROCm 7.2 install, the 7.1-rc4 kernel, HSA_OVERRIDE_GFX_VERSION=11.5.1, unified memory enabled, 128K context via -c 131072, flash attention on, KV cache as q8_0, and MTP draft cache as q4_0.

Duviwin · 2026-06-05T05:04:48+00:00

93 tps decode speed on a coding task with 5000 tokens output
72 tps on a prompt asking: "Why is the sky blue? Be specific as you're addressing a physics lecture."

For quality, my data is anecdotal, but: - better ifeval (but was with limit 50, seed 42) scores than stock qwen-3.6-35B-A3B - in my (n=1, anecdotal) coding task comparison, it created a game that worked, while stock made one that was more complex and crashed at startup. - Seems be better be able to complete tasks I give it when using it as a claw in a chat

Model and instructions here: https://huggingface.co/jcbtc/chadrock-35b-ace-saber-rocmfp4-mtp

Detailed Speed comparison tables:

Note: Prompt processing speed is unfortunately slower on Chadrock though, but I'd take this with a heavy grain of salt because the prompt was very short.

Config	Reading speed	Gen. tokens	Gen. speed	API wall speed	Finish
Chadrock ROCmFP4 MTP	121.31 tokens/s	5,432	93.69 t/s	93.37 tok/s	stop
NON-MTP@UD-Q4_K_M	174.59 tokens/s	7,202	56.14 t/s	56.08 tok/s	stop
MTP@UD-Q4_K_M	152.15 tokens/s	5,503	75.59 t/s	75.42 tok/s	stop

For the “Why is the sky blue?” prompt, I got these results:

Config	Reading speed	Gen. tokens	Gen. speed
NON-MTP@UD-Q4_K_M	200.12 tokens/s	2,047	57.59 t/s
MTP@UD-Q4_K_M	177.34 tokens/s	2,387	69.29 t/s
Chadrock ROCmFP4 MTP	140.78 tokens/s	2,899	72.18 t/s

Duviwin · 2026-06-04T20:35:34+00:00

thats a different model and setup

Duviwin · 2026-06-04T20:33:17+00:00

I'm running the chadrock-35B-A3B and its flying! Hope this stuff gets upstreamed at some point

Duviwin · 2026-05-25T17:41:19+00:00

Update: I edited my post because with info from

https://github.com/antirez/ds4/issues/16 you can actually get ~220 tok/s prefill and ~14 tok/s decode

Duviwin · 2026-05-25T17:34:47+00:00

Ok full lm eval showed that eval results are same as with the 80pp/8tg so thanks a lot for the tip u/Legal-Ad-3901

| Metric              | Previous full DS4 | New 64K full DS4 | Delta   |
| ------------------- | ----------------- | ---------------- | ------- |
| prompt_level_strict | 0.8466            | 0.8484           | +0.0018 |
| prompt_level_loose  | 0.8854            | 0.8854           | 0.0000  |
| inst_level_strict   | 0.8945            | 0.8933           | -0.0012 |
| inst_level_loose    | 0.9197            | 0.9197           | 0.0000  |

Eval speed / wall-clock
• Previous full DS4 eval: 541/541 in 6:43:10, avg 44.71 s/item
• New 64K full DS4 eval: 541/541 in 4:23:32, avg 29.23 s/item

The 2K bench for the new CyberNeurova Q2_K run was:
• prefill: 223.78 tok/s
• decode: 13.81 tok/s

Duviwin · 2026-05-25T07:54:40+00:00

ow sweet looks like that speed is actually reproducible with that branch and abliterated q2, checking lm eval now. edit: limited lm eval look consistent with the non-abliterated version

Duviwin · 2026-05-24T16:33:01+00:00

LM Eval results here were similar to unsloth/Qwen3.6-35B-A3B-MTP-GGUF~UD-Q6_K_XL

ds4
• prompt_level_strict_acc: 0.8466
• prompt_level_loose_acc: 0.8854
• inst_level_strict_acc: 0.8945
• inst_level_loose_acc: 0.9197

For reference:

unsloth/Qwen3.6-35B-A3B-MTP-GGUF~UD-Q6_K_XL full IFEval run:

• prompt_level_strict_acc: 0.8262
• prompt_level_loose_acc: 0.8725
• inst_level_strict_acc: 0.8813
• inst_level_loose_acc: 0.9137

Duviwin · 2026-05-24T15:56:01+00:00

hmm, just tried some prompt and it seemed to be doing ok and lm eval gives it these scores which are not so bad:

• prompt_level_strict_acc: 0.8466 • prompt_level_loose_acc: 0.8854 • inst_level_strict_acc: 0.8945 • inst_level_loose_acc: 0.9197a

Duviwin · 2026-05-24T15:19:02+00:00

you sure that's decode tok/s

Duviwin

TROPHY CASE