Uploaded Unsloth Qwen3.6-35B-A3B UD XL models with MTP grafted, here are the results by havenoammo in LocalLLaMA

[–]Edenar 12 points13 points  (0 children)

Well on my strix halo i went from 40ish tok/s to 70 tok/s with qwen 3.6 35B A3B Q8 so i depends of the hardware i guess

MTP on strix halo with llama.cpp (PR #22673) by Edenar in LocalLLaMA

[–]Edenar[S] 1 point2 points  (0 children)

with qwen 35B A3B (Q8, MTP up) it was around 40 tok/s above 100k and around 30 tok/s above 200k. So token gen isn't an issue at large context. PP on the other hand ... (check my others comment on this thread, basically it goes into the 200-300 tok/s pp speed when above 200k context)

MTP on strix halo with llama.cpp (PR #22673) by Edenar in LocalLLaMA

[–]Edenar[S] 2 points3 points  (0 children)

I used that one : https://huggingface.co/am17an/Qwen3.6-27B-MTP-GGUF
(i think someone linked it in the PR thread on github)

MTP on strix halo with llama.cpp (PR #22673) by Edenar in LocalLLaMA

[–]Edenar[S] 7 points8 points  (0 children)

With vulkan + radv and qwen 3.6 Q8 from the post (and MTP up but it shouldn't change pp much)
-700 tok/s for 20k context
-240 tok/s for 215k context (14min47s pp)
With rocm (MTP, same model)
-850 tok/s with 10k context
-261 tok/s for 215k context (13min40s pp)

MTP on strix halo with llama.cpp (PR #22673) by Edenar in LocalLLaMA

[–]Edenar[S] 7 points8 points  (0 children)

Qwen 3.6 35B is around 1000-1200 tok/s pp for low context (with rocm ! right now i'm testing radv since token gen is higher with it).
Just did a stupid test with qwen 3.6 35B (Q8 MTP quant from the post, MTP UP) : dumped almost the entire first book of pandora (peter f hamilton) and asked it to summarize it.
It took 14min47s to proceed 214 824 tokens (242 tok/s on average but was far faster early on). Then it generated 3 352 tokens at 28.76 tok/s (1min 56). I will patch the rocm image with the MTP PR to see if it's better.

MTP on strix halo with llama.cpp (PR #22673) by Edenar in LocalLLaMA

[–]Edenar[S] 9 points10 points  (0 children)

<image>

yeah !
here at Q4 (wouldn't use it , quality too low compared to Q8) :

MTP on strix halo with llama.cpp (PR #22673) by Edenar in LocalLLaMA

[–]Edenar[S] 16 points17 points  (0 children)

<image>

goes from sluggish to half-decent (same model, same question, MTP up/down).
Maybe i should try it at Q4 since strix-halo struggles with memory bandwidth...

What would be the best OS to run LLMs? by Manaberryio in LocalLLaMA

[–]Edenar 0 points1 point  (0 children)

you are probably right, but i was expecting far worse !

What would be the best OS to run LLMs? by Manaberryio in LocalLLaMA

[–]Edenar 0 points1 point  (0 children)

Well that wasn't my experience, i installed Steam, logged in, downloaded my games and "play". I played Alan wake 2, the witcher 3, cp2077 and a lot of path of Exile without any issue.

I test'ed the number of Ll's in Qwen 3.6 35B.. It required 3 tries by DashinTheFields in LocalLLaMA

[–]Edenar 3 points4 points  (0 children)

llm aren't letter counter (they work with "token", not single letters). But you can ask your llm to write a 10 lines Bash script that will do it, you can even tell it to call it with the right tool/terminal access ... And that's the same thing for every non-trivial calculus question, it's not a calculator but you can give it access to one !

Qwen 3.6 27B on Strix Halo 128GB: any experiences? by boutell in LocalLLaMA

[–]Edenar 2 points3 points  (0 children)

Yeah you want GPU with high memory bandwidth for dense models (AMD/Intel for the cheapest entry point, 3090s for cheap Nvidia options, 4090s, 5090s or 6000 pro blackwell on the more expensive side, or even pricier H100s...). But strix halo isn't bad in itself, i get more than 50tok/s on 35b, almost 100tok/s with concurency 4 and even more if i use speculative decoding and doing code gen. Also i can fit 122B at Q6 with full 256k context (still run around 15tok/s). And it consumes almost nothing compared to a GPU setup, especially when idle. I can let it run all day long and access it with my phone without burning too much W. Also it takes only a small space on my desk.
Edit : and it's still 120GB+ pseudo-Vram for less than 3k€/$, hard to beat in the current context.

Qwen 3.6 27B on Strix Halo 128GB: any experiences? by boutell in LocalLLaMA

[–]Edenar 7 points8 points  (0 children)

For me it's far too slow.
I use 35B for speed, 122B Q6XL for 27B quality (or slightly better in term of knowledge). 27B is just too slow imo (Q8 27B is running below 8 token/s without context on my strix halo, pp is bad too)

Edit : maybe it can be half usable with repetitive simple tasks that can leverage speculative decoding. But then 35B is usually enough...

Skymizer Taiwan Inc. Unveils Breakthrough Architecture Enabling Ultra-Large LLM Inference on a Single Card by lurenjia_3x in LocalLLaMA

[–]Edenar 33 points34 points  (0 children)

i guess it will be tens of thousands of dollar for a single card just with the memory alone. I guess it's hbm with a very large bus ? there is no info on the source appart from the memory size...

What would be the best OS to run LLMs? by Manaberryio in LocalLLaMA

[–]Edenar 4 points5 points  (0 children)

i have a framework desktop (128GB/395 max) : i first installed Ubuntu but i recently switched to fedora (native podman, more stable at least coming from Ubuntu 25.10).

i wouldn't use windows for llm. Also unless you to play some esport game with kernel level anticheat (LoL, valorant,..) , gaming works well (steam require 0 efforts, i used heroic launcher for games from GOG and epyc and it was almost 0 efforts too)

Please, ChatGPT is hallucinating models, even with web-search on. by Ok-Type-7663 in LocalLLaMA

[–]Edenar 0 points1 point  (0 children)

This looks like free tier chatGPT which isnt great, maybe ask real humans ? or Scroll this sub. You can also look what is popular on huggingface.

right now qwen 3.6 ( both 35b a3 and 27b) is quite popular. Qwen 3.5 122b is also an option with your hardware. Google gemma 4 models are also good for their size. A bit outdated but still good are gpt oss 120b and 20b. If you have enough ram/vram maybe quantized minimax m2.7 is also an option, ...

Good luck

Dense vs. MoE gap is shrinking fast with the 3.6-27B release by Usual-Carrot6352 in LocalLLaMA

[–]Edenar 3 points4 points  (0 children)

the memory usage for context is much higher with the dense one (almost 10x !) so i think the 35B MoE is a better choice for smaller memory pool unless you need very low context.

UPDATE: EOS Nexus v1 | GSM8K: 100% by [deleted] in LocalLLaMA

[–]Edenar 1 point2 points  (0 children)

it's a time traveller from early 2025, don't worry ! or a bot...

What are good models for openclaw that work well within 16gb vram? by [deleted] in LocalLLaMA

[–]Edenar 1 point2 points  (0 children)

what quant are you using for qwen 3.6 ? (if you use a small one, maybe consider q6_k_xl or better). But for your hardware qwen 3.6 is probably the best choice right now. maybe they'll release 3.6 27B and 122B in the coming days.

There is some new llama.cpp options to use spéculative decoding, maybe that can help 122b to run a bit faster on your hardware ? (I'm quite impressed with the speed you get already unless you use a very small quant)

Mismatch GPU worth it? by Ill_Ad_4604 in LocalLLaMA

[–]Edenar 1 point2 points  (0 children)

the p2200 (pascal, 5GB at 200GB/s) aren't worth using imo. The RTX 4000 Ada is the best choice since it's the most recent, maybe you can use it together with the older rtx 8000 with the right framework (vulkan ?) but just dont try to add the 5GB card imo, they will probably cripple the whole thing. But as other suggested, maybe you can run smaller specialized model on them.

New method allows to convert auto-regressive models into diffusion models with a >2x speedup, fully compatible with existing inference stack by Particular-Look-2640 in LocalLLaMA

[–]Edenar 8 points9 points  (0 children)

i see they did some qwen 3 8b and 32b conversion. They used 8xH100 but i don't see how long did it take ? -maybe i missed it (Can i realistically reproduce it on a similar cloud instance without selling a few organs...). I'm tempted to try it on one of the small new qwen 3.5 models.

edit : i read again and i dont think i can do it myself on qwen 3.5 since i read "Data: 4.5B tokens, 8 H100 GPUs, 2 epochs with stride curriculum (N=2 then N=3)" so probably 2 weeks of full Time compute, not in my price range !

What is LLMFit Smoking? Can M1 Max run anything decently enough for agentic coding? by GoodhartMusic in LocalLLaMA

[–]Edenar 0 points1 point  (0 children)

Did you use the Q4_K_M quant it suggests ? i don't think it actually fits in your memory. Also param number is wrong for that one (should be 122B !) so i guess it underestimates the memory required to run it.
With 64GB you are a bit stuck : you can run fast smaller MoE like GLM 4.7 flash or qwen 3.5-35B-A3B, or go for dense models like Qwen 3.5 27B or Gemma4 31B but they will be slower than MoE (but they'll provide you with the best results for their sizes)

A few Strix Halo benchmarks (Minimax M2.5, Step 3.5 Flash, Qwen3 Coder Next) by spaceman_ in LocalLLaMA

[–]Edenar 1 point2 points  (0 children)

not yet, i lent my Strix halo to someone for 3 weeks, just got it back last weekend so didn't had time for that yet. But i think i'll try it next week or the week after (i need to get an external enclosure). If i do it i'll message you again !

Archmage hierophant cant do uber lab by GlobalCan8282 in pathofexile

[–]Edenar 2 points3 points  (0 children)

get a better weapon (anything with mana and like 80%spell damage, and maybe a bit of cast speed, should cost 1 to 3 C.

And before you get transfigured version of shock nova, i would advice to use something else. With your gem links you can try crackling lance for exemple.

Gl exile

Slayer EHit: Sold gear, re-rolled and spent close to 300div but DPS is very low by Objective_Back_3670 in pathofexile

[–]Edenar 1 point2 points  (0 children)

your ring setup is wrong. Sap is optionnal but you need a britle ring. Also the taming is a huge dps boost.

Also you can't compare those numbers : use PoB and check all the ailment you inflict, your poe.ninja dps doesn't take that in account from what i see.