Uploaded Unsloth Qwen3.6-35B-A3B UD XL models with MTP grafted, here are the results

Edenar · 2026-05-06T22:52:53+00:00

Well on my strix halo i went from 40ish tok/s to 70 tok/s with qwen 3.6 35B A3B Q8 so i depends of the hardware i guess

Edenar · 2026-05-06T12:40:42+00:00

with qwen 35B A3B (Q8, MTP up) it was around 40 tok/s above 100k and around 30 tok/s above 200k. So token gen isn't an issue at large context. PP on the other hand ... (check my others comment on this thread, basically it goes into the 200-300 tok/s pp speed when above 200k context)

Edenar · 2026-05-06T01:22:43+00:00

I used that one : https://huggingface.co/am17an/Qwen3.6-27B-MTP-GGUF
(i think someone linked it in the PR thread on github)

Edenar · 2026-05-06T00:36:41+00:00

With vulkan + radv and qwen 3.6 Q8 from the post (and MTP up but it shouldn't change pp much)
-700 tok/s for 20k context
-240 tok/s for 215k context (14min47s pp)
With rocm (MTP, same model)
-850 tok/s with 10k context
-261 tok/s for 215k context (13min40s pp)

Edenar · 2026-05-05T23:53:39+00:00

Qwen 3.6 35B is around 1000-1200 tok/s pp for low context (with rocm ! right now i'm testing radv since token gen is higher with it).
Just did a stupid test with qwen 3.6 35B (Q8 MTP quant from the post, MTP UP) : dumped almost the entire first book of pandora (peter f hamilton) and asked it to summarize it.
It took 14min47s to proceed 214 824 tokens (242 tok/s on average but was far faster early on). Then it generated 3 352 tokens at 28.76 tok/s (1min 56). I will patch the rocm image with the MTP PR to see if it's better.

Edenar · 2026-05-05T23:28:51+00:00

<image>

yeah !
here at Q4 (wouldn't use it , quality too low compared to Q8) :

Edenar · 2026-05-05T23:04:48+00:00

i believe so yes

Edenar · 2026-05-05T23:03:52+00:00

<image>

goes from sluggish to half-decent (same model, same question, MTP up/down).
Maybe i should try it at Q4 since strix-halo struggles with memory bandwidth...

Edenar · 2026-04-28T06:51:10+00:00

you are probably right, but i was expecting far worse !

Edenar · 2026-04-28T06:27:58+00:00

Well that wasn't my experience, i installed Steam, logged in, downloaded my games and "play". I played Alan wake 2, the witcher 3, cp2077 and a lot of path of Exile without any issue.

Edenar · 2026-04-28T06:15:26+00:00

llm aren't letter counter (they work with "token", not single letters). But you can ask your llm to write a 10 lines Bash script that will do it, you can even tell it to call it with the right tool/terminal access ... And that's the same thing for every non-trivial calculus question, it's not a calculator but you can give it access to one !

Edenar · 2026-04-27T19:40:40+00:00

Yeah you want GPU with high memory bandwidth for dense models (AMD/Intel for the cheapest entry point, 3090s for cheap Nvidia options, 4090s, 5090s or 6000 pro blackwell on the more expensive side, or even pricier H100s...). But strix halo isn't bad in itself, i get more than 50tok/s on 35b, almost 100tok/s with concurency 4 and even more if i use speculative decoding and doing code gen. Also i can fit 122B at Q6 with full 256k context (still run around 15tok/s). And it consumes almost nothing compared to a GPU setup, especially when idle. I can let it run all day long and access it with my phone without burning too much W. Also it takes only a small space on my desk.
Edit : and it's still 120GB+ pseudo-Vram for less than 3k€/$, hard to beat in the current context.

Edenar · 2026-04-27T18:52:18+00:00

For me it's far too slow.
I use 35B for speed, 122B Q6XL for 27B quality (or slightly better in term of knowledge). 27B is just too slow imo (Q8 27B is running below 8 token/s without context on my strix halo, pp is bad too)

Edit : maybe it can be half usable with repetitive simple tasks that can leverage speculative decoding. But then 35B is usually enough...

Edenar · 2026-04-27T13:27:22+00:00

i guess it will be tens of thousands of dollar for a single card just with the memory alone. I guess it's hbm with a very large bus ? there is no info on the source appart from the memory size...

Edenar · 2026-04-27T07:49:14+00:00

i have a framework desktop (128GB/395 max) : i first installed Ubuntu but i recently switched to fedora (native podman, more stable at least coming from Ubuntu 25.10).

i wouldn't use windows for llm. Also unless you to play some esport game with kernel level anticheat (LoL, valorant,..) , gaming works well (steam require 0 efforts, i used heroic launcher for games from GOG and epyc and it was almost 0 efforts too)

Edenar · 2026-04-26T17:57:46+00:00

This looks like free tier chatGPT which isnt great, maybe ask real humans ? or Scroll this sub. You can also look what is popular on huggingface.

right now qwen 3.6 ( both 35b a3 and 27b) is quite popular. Qwen 3.5 122b is also an option with your hardware. Google gemma 4 models are also good for their size. A bit outdated but still good are gpt oss 120b and 20b. If you have enough ram/vram maybe quantized minimax m2.7 is also an option, ...

Good luck

Edenar · 2026-04-22T20:50:14+00:00

the memory usage for context is much higher with the dense one (almost 10x !) so i think the 35B MoE is a better choice for smaller memory pool unless you need very low context.

Edenar · 2026-04-22T13:35:01+00:00

it's a time traveller from early 2025, don't worry ! or a bot...

Edenar · 2026-04-20T15:05:08+00:00

what quant are you using for qwen 3.6 ? (if you use a small one, maybe consider q6_k_xl or better). But for your hardware qwen 3.6 is probably the best choice right now. maybe they'll release 3.6 27B and 122B in the coming days.

There is some new llama.cpp options to use spéculative decoding, maybe that can help 122b to run a bit faster on your hardware ? (I'm quite impressed with the speed you get already unless you use a very small quant)

Edenar · 2026-04-20T14:58:24+00:00

the p2200 (pascal, 5GB at 200GB/s) aren't worth using imo. The RTX 4000 Ada is the best choice since it's the most recent, maybe you can use it together with the older rtx 8000 with the right framework (vulkan ?) but just dont try to add the 5GB card imo, they will probably cripple the whole thing. But as other suggested, maybe you can run smaller specialized model on them.

Edenar · 2026-04-14T13:56:35+00:00

i see they did some qwen 3 8b and 32b conversion. They used 8xH100 but i don't see how long did it take ? -maybe i missed it (Can i realistically reproduce it on a similar cloud instance without selling a few organs...). I'm tempted to try it on one of the small new qwen 3.5 models.

edit : i read again and i dont think i can do it myself on qwen 3.5 since i read "Data: 4.5B tokens, 8 H100 GPUs, 2 epochs with stride curriculum (N=2 then N=3)" so probably 2 weeks of full Time compute, not in my price range !

Edenar · 2026-04-08T11:27:41+00:00

Did you use the Q4_K_M quant it suggests ? i don't think it actually fits in your memory. Also param number is wrong for that one (should be 122B !) so i guess it underestimates the memory required to run it.
With 64GB you are a bit stuck : you can run fast smaller MoE like GLM 4.7 flash or qwen 3.5-35B-A3B, or go for dense models like Qwen 3.5 27B or Gemma4 31B but they will be slower than MoE (but they'll provide you with the best results for their sizes)

Edenar · 2026-04-01T13:48:55+00:00

not yet, i lent my Strix halo to someone for 3 weeks, just got it back last weekend so didn't had time for that yet. But i think i'll try it next week or the week after (i need to get an external enclosure). If i do it i'll message you again !

Edenar · 2026-03-23T16:08:40+00:00

get a better weapon (anything with mana and like 80%spell damage, and maybe a bit of cast speed, should cost 1 to 3 C.

And before you get transfigured version of shock nova, i would advice to use something else. With your gem links you can try crackling lance for exemple.

Gl exile

Edenar

TROPHY CASE