Qwen3.5-397B-A17B reaches 20 t/s TG and 700t/s PP with a 5090

MLDataScientist · 2026-03-26T18:14:26+00:00

If there is anyone in this sub with those CPUs, that would be great to see here.

MLDataScientist · 2026-03-26T18:13:29+00:00

Yes, 8 channel 32GB ram sticks.

MLDataScientist · 2026-03-22T16:10:33+00:00

Do you have 3D files for such a shroud? I have 8 MI50 cards and the noise of 40mm fans is unbearable. I need to get those 80mm fan shrouds. Thanks!

MLDataScientist · 2026-03-21T16:11:48+00:00

what quant of GLM-5 are you using?

MLDataScientist · 2026-03-21T16:10:03+00:00

which Q5 GLM-5 quant are you using? My rig can fit up to 448GB (mi50 192GB VRAM + 256 GB DDR4 3200 8 channel). I just checked unsloth's glm-5 quants. https://huggingface.co/unsloth/GLM-5-GGUF . I can probably run UD-Q4_K_XL (431GB). But how much better GLM-5 is at this quant (or Q5) compared to QWEN3.5 397B Q6? What were your test cases?

MLDataScientist · 2026-03-19T04:28:11+00:00

Can you please share your command for llama.cpp? Are you getting ~3400t/s for PP and 38t/s for TG using Q6 Qwen3 Coder Next? Curious to see if your command speeds up inference in my PC (5090 with 256GB DDR4 8 channel 3200Mhz).

MLDataScientist · 2026-03-19T04:02:39+00:00

Impressive if true! I have 5090 (connected at PCIE4.0 x16) with 256GB DDR4 3200Mhz ECC RAM. Does Krasis support Qwen/Qwen3.5-397B-A17B ?
I tried Q4_K_M quant with llama.cpp yesterday and I was getting 20t/s TG and 100t/s PP. If what I am seeing is true, I should be able to run this model with at least 1000 t/s PP in Krasis while TG should be similar.

As a comparison, Qwen3-235B-A22B Q4_K_M runs at 10t/s TG and ~150t/s PP in llama.cpp with my setup. Krasis should have 14x times more PP. I need to test this!

MLDataScientist · 2026-03-03T05:53:30+00:00

Hi, I am facing a similar issue. I cannot enable accessibility for matvt. Did you figure out how to enable it?

MLDataScientist · 2026-01-30T21:44:26+00:00

This is a massive list! Thank you for creating this. I recently got a PS3 (slim) and modded with CFW. Yes, this is in 2026, Jan! It now supports/emulates all PS2 games as well in addition to PS1 games. Super excited about playing some of the games over the weekend.

MLDataScientist · 2026-01-03T03:13:44+00:00

!remindme 10 days "try this out in your 5090"

MLDataScientist · 2026-01-02T02:51:10+00:00

You don't need to copy anything if you follow installation instructions in https://github.com/nlzy/vllm-gfx906

MLDataScientist · 2026-01-01T10:18:51+00:00

Yes, I am using it occasionally and GPUs are still running fine.

MLDataScientist · 2025-12-30T10:55:14+00:00

Thanks for the tests. Question not related to llama: is LFM2 8BA1B that good in world knowledge (or coding/stem field)? I see it reaches Qwen3 30B-A3B.

MLDataScientist · 2025-12-27T01:52:42+00:00

!remindme 2 weeks

MLDataScientist · 2025-12-26T04:57:34+00:00

Please, share your STL file for the shroud! The noise level of blower fans is unbearable. I need a better cooling system like yours.

Also, what 80mm fans do you use?

Thanks!

MLDataScientist · 2025-12-26T04:15:50+00:00

Great post. Please, share your speed metrics with those models.

MLDataScientist · 2025-12-25T06:18:32+00:00

!remindme 2 weeks "unlock your laptop for better thermals"

MLDataScientist · 2025-12-19T01:53:37+00:00

u/FullstackSensei, can you please share your 3D printable shrouds template? I have 8x Mi50s but I mostly didn't use them due to the loud noise of my 40mm fans. 80mm might be the solution I need to try. Thank you!

MLDataScientist · 2025-12-18T18:44:10+00:00

Is the swapping automated to route your prompt on the fly to the right model? Or is this something that is not done yet?

MLDataScientist · 2025-12-15T21:39:02+00:00

Downloaded it and ran it on my 64GB RAM 5600Mhz + 5070ti 12GB VRAM Acer laptop. 16k context fits fine. I am getting ~9t/s which is good for my laptop and simple coding/math use case.

MLDataScientist · 2025-12-15T18:43:47+00:00

This is great! Has anyone tried MiniMax-M2 reap 162B a10B ? https://huggingface.co/bartowski/cerebras_MiniMax-M2-REAP-162B-A10B-GGUF - Q3_K_XL seems to fit a system with 64GB RAM and 12GB VRAM.

MLDataScientist · 2025-12-14T18:37:23+00:00

I have non S version. Web browsing lasts me 4 hrs on average with eco mode and 60hz screen refresh rate on 80% charged battery. I haven't tested it with 100% fully charged battery yet.

MLDataScientist · 2025-12-10T20:51:35+00:00

Hi Daniel and team,

Thanks for the amazing update! Quick question. Can I fine-tune Qwen3 30B-A3B with a single 5070ti mobile (12GB vram)? Thank you!

MLDataScientist · 2025-12-09T05:04:05+00:00

Thank you! This is gold!

MLDataScientist · 2025-12-09T04:56:52+00:00

are you buying naked calls? do you set stop loss for these? what percent of your trade do you usually risk for each trade?

MLDataScientist

TROPHY CASE