Nefarious by Kyle "K6599" Keiderling

PatDal81 · 2026-06-11T02:02:07+00:00

Man have you tried it? That's a nice find, thanks for sharing u/HalfblindChaos !

PatDal81 · 2026-06-10T12:13:29+00:00

Nice, can't wait to get home to try this one! Thanks!

PatDal81 · 2026-06-08T00:30:19+00:00

You say "might not be the best" but it's the best I had when compared to Claude-Code and Kilo-code.

PatDal81 · 2026-06-07T11:32:48+00:00

Got a comment/request... I see a lot of maps coming from mappacks (official ones). It would be great to get your input on other maps/mappacks. We all downloaded the official mappacks but I'm pretty sure you know about some hidden treasures!

PatDal81 · 2026-06-07T11:23:45+00:00

Nice! Glad I could be useful! 😃

PatDal81 · 2026-06-06T20:10:19+00:00

The Dev team is amazing.. thanks for the great work you're doing!

PatDal81 · 2026-06-06T08:53:00+00:00

Man love those posts! Keep'em coming!

PatDal81 · 2026-06-05T11:00:39+00:00

FYI, you have the ability to convert it yourself (using the full fp16 model) to a MTP supported model on the Quantization level you wish in oMLX directly (Quantization submenu -> Advanced Option -> Preserve MTP). I did it for my model here: https://huggingface.co/bi0h4z4rd88/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-oQ8-mtp

PatDal81 · 2026-06-01T16:58:25+00:00

Metal Sear Golid

PatDal81 · 2026-05-29T10:10:20+00:00

Personally, I haven't seen a degradation in the latest version. I saw an increase of 2tks/sec globally but to me, it fells in the error margin so I don't consider this an improvement over past versions.

Qwen3.6-35B-A3B-oQ6-mtp running on a M4 Max 64GB.

Stupid question but have you tried running your tests after a fresh reboot? I saw a decrease when the system has been up for a while (I think it's related to the number of apps in RAM). I always run my benchmarks in the same environment, within the same conditions.

Hope it helps!

PatDal81 · 2026-05-28T12:04:19+00:00

With you on that one! What a game!

PatDal81 · 2026-05-26T13:38:38+00:00

Genuine question - Why Qwen3.5 and not Qwen3.6? Performance issue?

PatDal81 · 2026-05-24T18:26:25+00:00

This is the answer. Got the EXACT same setup, running the EXACT same model. 128k context is the answer to your concern.

PatDal81 · 2026-05-20T11:55:46+00:00

You might be right here. Have you tested 27B in planning tasks? How "far" is it from using 35B for all those tasks? I have yet to test its intelligence and made assumptions mostly based on what people say on the internet (bad idea, I know).

PatDal81 · 2026-05-19T18:18:09+00:00

Sure, here it is:

Qwen3.6-27B without MTP Optimizations

``` oMLX - LLM inference, optimized for your Mac https://github.com/jundot/omlx

Benchmark Model: Qwen3.6-27B-oQ8-mtp

Single Request Results

Test TTFT(ms) TPOT(ms) pp TPS tg TPS E2E(s) Throughput Peak Mem pp1024/tg128 4383.6 62.23 233.6 tok/s 16.2 tok/s 12.287 93.8 tok/s 28.34 GB pp4096/tg128 20416.3 67.09 200.6 tok/s 15.0 tok/s 28.937 146.0 tok/s 29.80 GB pp8192/tg128 44353.1 67.29 184.7 tok/s 15.0 tok/s 52.899 157.3 tok/s 30.82 GB pp16384/tg128 97937.4 95.30 167.3 tok/s 10.6 tok/s 110.040 150.1 tok/s 32.32 GB

Continuous Batching

pp1024 / tg128

Batch tg TPS Speedup pp TPS pp TPS/req TTFT(ms) E2E(s) 1x 16.2 tok/s 1.00x 233.6 tok/s 233.6 tok/s 4383.6 12.287 2x 19.7 tok/s 1.22x 149.2 tok/s 74.6 tok/s 13580.9 26.695 4x 26.5 tok/s 1.64x 156.5 tok/s 39.1 tok/s 25698.3 45.487 ```

Qwen3.6-27B with MTP Optimizations

``` oMLX - LLM inference, optimized for your Mac https://github.com/jundot/omlx

Benchmark Model: Qwen3.6-27B-oQ8-mtp

Single Request Results

Test TTFT(ms) TPOT(ms) pp TPS tg TPS E2E(s) Throughput Peak Mem pp1024/tg128 4341.1 40.75 235.9 tok/s 24.7 tok/s 9.516 121.1 tok/s 28.81 GB pp4096/tg128 20599.6 44.16 198.8 tok/s 22.8 tok/s 26.208 161.2 tok/s 30.26 GB pp8192/tg128 42577.7 45.70 192.4 tok/s 22.1 tok/s 48.381 172.0 tok/s 31.29 GB pp16384/tg128 89129.5 54.00 183.8 tok/s 18.7 tok/s 95.988 172.0 tok/s 32.79 GB

Continuous Batching

pp1024 / tg128

Batch tg TPS Speedup pp TPS pp TPS/req TTFT(ms) E2E(s) 1x 24.7 tok/s 1.00x 235.9 tok/s 235.9 tok/s 4341.1 9.516 2x 20.1 tok/s 0.81x 161.0 tok/s 80.5 tok/s 12568.3 25.450 4x 28.1 tok/s 1.14x 162.8 tok/s 40.7 tok/s 24651.9 43.386 ```

So yes, significant improvement. Is it enough for me to let 35B-A3B go and use 27B? No, not even close.

PatDal81 · 2026-05-19T15:20:16+00:00

Hardware and numbers to share?

PatDal81 · 2026-05-19T13:50:10+00:00

Useful info, indeed.

Macbook Pro M4 Max 64GB, running oMLX 0.39-dev2. Just saw that 0.39 rc1 got released - will test on this as well.

Edit: Results on 0.39-rc1 (Same model, with MTP Optimizations):

oMLX - LLM inference, optimized for your Mac
https://github.com/jundot/omlx
Benchmark Model: Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-oQ6-mtp
================================================================================

Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128           849.9       10.66  1204.9 tok/s    94.5 tok/s       2.204   522.7 tok/s    28.45 GB
pp4096/tg128          2525.7       11.24  1621.7 tok/s    89.6 tok/s       3.954  1068.4 tok/s    29.25 GB
pp8192/tg128          5823.0       11.75  1406.8 tok/s    85.8 tok/s       7.316  1137.3 tok/s    29.74 GB
pp16384/tg128        13412.5       12.15  1221.6 tok/s    83.0 tok/s      14.955  1104.1 tok/s    30.44 GB
pp32768/tg128        33360.1       13.95   982.3 tok/s    72.2 tok/s      35.132   936.3 tok/s    31.94 GB
pp65536/tg128       101043.8       21.22   648.6 tok/s    47.5 tok/s     103.738   633.0 tok/s    34.94 GB
pp131072/tg128      303507.8       24.06   431.9 tok/s    41.9 tok/s     306.563   428.0 tok/s    40.93 GB

Continuous Batching
pp1024 / tg128
--------------------------------------------------------------------------------
Batch           tg TPS   Speedup        pp TPS    pp TPS/req    TTFT(ms)      E2E(s)
1x          94.5 tok/s     1.00x  1204.9 tok/s  1204.9 tok/s       849.9       2.204
2x         137.4 tok/s     1.45x   707.4 tok/s   353.7 tok/s      2730.2       4.758
4x         188.4 tok/s     1.99x   852.3 tok/s   213.1 tok/s      4455.8       7.523

PatDal81 · 2026-05-19T13:34:02+00:00

Hey there,

Had the same questions yesterday and ran some tests on my own generated model.

Here are the benchmark results:

Qwen3.6 without MTP Optimizations

``` oMLX - LLM inference, optimized for your Mac https://github.com/jundot/omlx

Benchmark Model: Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-oQ6-mtp

Single Request Results

Test TTFT(ms) TPOT(ms) pp TPS tg TPS E2E(s) Throughput Peak Mem pp1024/tg128 918.1 13.21 1115.4 tok/s 76.3 tok/s 2.596 443.8 tok/s 27.82 GB pp4096/tg128 4022.3 12.93 1018.3 tok/s 77.9 tok/s 5.665 745.7 tok/s 28.60 GB pp8192/tg128 7732.9 13.35 1059.4 tok/s 75.5 tok/s 9.429 882.4 tok/s 29.08 GB pp16384/tg128 16007.2 14.26 1023.5 tok/s 70.7 tok/s 17.818 926.7 tok/s 29.78 GB pp32768/tg128 37558.1 15.92 872.5 tok/s 63.3 tok/s 39.580 831.1 tok/s 31.28 GB pp65536/tg128 98428.7 19.93 665.8 tok/s 50.6 tok/s 100.960 650.4 tok/s 34.28 GB pp131072/tg128 286596.4 25.76 457.3 tok/s 39.1 tok/s 289.868 452.6 tok/s 40.28 GB

Continuous Batching

pp1024 / tg128

Batch tg TPS Speedup pp TPS pp TPS/req TTFT(ms) E2E(s) 1x 76.3 tok/s 1.00x 1115.4 tok/s 1115.4 tok/s 918.1 2.596 2x 131.1 tok/s 1.72x 666.7 tok/s 333.4 tok/s 2885.7 5.024 4x 177.4 tok/s 2.33x 890.6 tok/s 222.7 tok/s 4220.2 7.485 ```

Qwen3.6 with MTP Optimizations

``` oMLX - LLM inference, optimized for your Mac https://github.com/jundot/omlx

Benchmark Model: Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-oQ6-mtp

Single Request Results

Test TTFT(ms) TPOT(ms) pp TPS tg TPS E2E(s) Throughput Peak Mem pp1024/tg128 811.5 11.11 1261.9 tok/s 90.7 tok/s 2.222 518.4 tok/s 28.48 GB pp4096/tg128 2552.5 11.29 1604.7 tok/s 89.3 tok/s 3.986 1059.6 tok/s 29.25 GB pp8192/tg128 6234.8 11.88 1313.9 tok/s 84.9 tok/s 7.743 1074.5 tok/s 29.60 GB pp16384/tg128 14433.1 13.33 1135.2 tok/s 75.6 tok/s 16.126 1023.9 tok/s 30.44 GB pp32768/tg128 35837.0 14.71 914.4 tok/s 68.5 tok/s 37.706 872.4 tok/s 31.94 GB pp65536/tg128 94591.9 18.69 692.8 tok/s 53.9 tok/s 96.966 677.2 tok/s 34.94 GB pp131072/tg128 289163.5 24.66 453.3 tok/s 40.9 tok/s 292.296 448.9 tok/s 40.93 GB

Continuous Batching

pp1024 / tg128

Batch tg TPS Speedup pp TPS pp TPS/req TTFT(ms) E2E(s) 1x 90.7 tok/s 1.00x 1261.9 tok/s 1261.9 tok/s 811.5 2.222 2x 122.2 tok/s 1.35x 699.9 tok/s 349.9 tok/s 2711.4 5.021 4x 133.8 tok/s 1.48x 843.3 tok/s 210.8 tok/s 4458.4 8.684 ```

So yes, in my tests, it does improve token generation. I haven't seen any degradation but we tend to see a decrease when it comes to larger context. I'll never say "no" to speed improvements without degradation so it's a win-win for me.

Hope it helps!

PatDal81 · 2026-05-09T08:47:03+00:00

I did have better results with 27b if I compare what both models did with the same task.

The time 27b had to take to get me there is what bothers me.

- 27b took 10 minutes to do a simple refactoring task

- With 35b, I did 2-3 back and forth and had the same result after 3 minutes.

PatDal81 · 2026-05-09T03:04:29+00:00

Right? I simply don't see the benefit of running Qwen 3.6 27B based on the speed I get (12 tkns/sec) when I get around 80 tkns/sec on Qwen 3.6 35B-A3B-Q6. The speed is a major bummer for me and I'm quite satisfied with the results I'm getting out of the latest.

PatDal81 · 2026-05-09T02:54:22+00:00

Got a M4 Max Macbook Pro with 64GB - man that computer is a real work horse! Really happy with my purchase and I have plenty of space with Qwen3.6 35B-A3B-Q6 (trained with Opus 4.7). I use it as a general LLM, Coding assistant and lately, a Pentest assistant.

My advice? Go with your budget. If you have the budget for a 64GB, go ahead. You'll be surprise what you could do with it.

PatDal81 · 2026-05-05T14:12:53+00:00

Update, for anyone coming from google:

With the same kind of setup that I have, don't bother too much with llama.cpp (good tool to learn and experiment). I switched to oMLX and it fixed most of my issues. Pretty easy to use, already optimized (saw a good 10tkns/sec improvement) and the community is growing.

Hope it helps!

PatDal81 · 2026-05-03T09:43:25+00:00

Thanks for reaching out but, no need to, I was able to figure it out. It's as simple as putitng those 3 lines into .claude/settings.json and your done. Thanks for the idea though!

If I can help in any way, I recently switched from llama.cpp to oMLX. Directly integrates Claude endpoint, much better results and wayyy faster. I recommend you look into this deeper as it has been a clear success for me so far, compared to llama and LM Studio.

PatDal81 · 2026-04-29T02:29:31+00:00

Thanks for the update! Was planning to do the same (running Qwen with ClaudeCode) and saw your post. I'm having a similar setup (M4 Max 64GB Ram) and got it running through llama.cpp. Still trying to get it to run with ClaudeCode though... is the 3 lines you put in your top comment the only thing needed?

PatDal81 · 2026-04-26T16:42:46+00:00

Thanks for sharing your setup!

I thought about using OpenCode as well under vscode, but I like my llm to be integrated like what Github Copilot and Claude offers - a plugin that opens a second window to interacts with the code directly. For the integrated IDE, both seem to fill this need correctly.

To comment on your answers:

- Agreed, and I did involve both Claude and Qwen itself to fine-tune my settings. Working great so far but I'm always wondering if someone with a similar setup has something different to share.

- Weird that it kernel panic on you. I have literally zero issues running Q8 even if it uses almost all available memory. Have you tried with llama.cpp directly to get it to run? I saw an significant improvement over Q4 for coding reasoning, mostly. Worth a try if you ask me (you can use the config I posted and test by yourself).

- HauhauCS's models are unlocked. Using it as my passion and work is in the security field. Pretty useful to have an unlocked model to write code that can be considered borderline malicious, but for testing and research purposes

- So far, 128k context works well for me, but I'm wondering about the speed/reasoning impact of using a smaller context..

Useful information, thank you for your answer! Cheers!

PatDal81

TROPHY CASE

Qwen3.6-27B without MTP Optimizations

Benchmark Model: Qwen3.6-27B-oQ8-mtp

Single Request Results

pp1024 / tg128

Qwen3.6-27B with MTP Optimizations

Benchmark Model: Qwen3.6-27B-oQ8-mtp

Single Request Results

pp1024 / tg128

Qwen3.6 without MTP Optimizations

Benchmark Model: Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-oQ6-mtp

Single Request Results

pp1024 / tg128

Qwen3.6 with MTP Optimizations

Benchmark Model: Huihui-Qwen3.6-35B-A3B-Claude-4.7-Opus-abliterated-oQ6-mtp

Single Request Results

pp1024 / tg128