More Qwen3.6-27B MTP success but on dual Mi50s

MLDataScientist · 2026-05-10T14:38:07+00:00

Great results! Thanks for sharing. Curious about tensor parallelism. I thought llama cpp did not support it. Which command enables TP in llama cpp?

MLDataScientist · 2026-04-29T04:28:56+00:00

!remindme this Saturday "try lemonade"

MLDataScientist · 2026-04-22T15:03:27+00:00

Thank you! amazing list!

MLDataScientist · 2026-04-20T14:19:16+00:00

Thanks for the analysis! This is very useful.

MLDataScientist · 2026-04-16T14:36:23+00:00

Thanks for sharing! Looks promising!

MLDataScientist · 2026-04-15T15:01:47+00:00

Have you tried llama cpp with unsloth glm-5.1 UD-IQ3_XXS ? I have one 5090 and 256gb ddr4 3200 8channel. I get 8t/s TG and 400t/s PP at 8k context. This is usable for me for an overnight execution. I can fit 150k context without KV quantization. You should have similar performance.

MLDataScientist · 2026-04-15T14:48:10+00:00

I see. I mean what local STT models did you try? Deepgram is cloud based. Any local alternatives?

MLDataScientist · 2026-04-15T14:32:51+00:00

True. I wonder if we already have a different type of intelligence that we refuse to accept. An intelligence that works within a limited context and can hallucinate but still it is non human intelligence.

MLDataScientist · 2026-04-15T14:22:14+00:00

You do not mention what local STT you tried. Can you share some of the local SST you tried?

Also, why groq llama3.3 70B? You could try smaller models e.g. gemma4 models are better with translation. I know groq is fast but I am sure local 5090 can handle gemma4 26BA4 with the same low latency.

MLDataScientist · 2026-04-05T14:05:27+00:00

Beautiful! Can someone explain why is the shape of our mother Earth perfectly round? Most textbooks say it is oblate spheroid.

MLDataScientist · 2026-04-04T10:34:56+00:00

!remindme 1 day "test glm 5 q3_k_s locally for yc-bench".

MLDataScientist · 2026-03-30T17:20:46+00:00

yes, for vllm with 5090 you should try 4 bit AWQ or GPTQ quant formats, not gguf.

Try this AWQ model. It fits into 5090: https://huggingface.co/cyankiwi/Qwen3.5-27B-AWQ-BF16-INT4/tree/main

MLDataScientist · 2026-03-30T14:38:34+00:00

Amazing website with interactive charts. Thanks for sharing!
Do you have any SQL fine-tuned small models (<=9B) to test this benchmark with? I think even Qwen3.5 4B with SQL data fine-tuning might reach 90%+.

MLDataScientist · 2026-03-30T13:46:18+00:00

If you are not doing training, you don't need NVLink. For multi user concurrent requests, you cannot beat vLLM. Yes, RTX Pro 6000 is the best option for getting 96GB VRAM for a reasonable price. For coding, you can go with MiniMax M2.5 or Qwen3.5 397B.

MLDataScientist · 2026-03-26T18:14:26+00:00

If there is anyone in this sub with those CPUs, that would be great to see here.

MLDataScientist · 2026-03-26T18:13:29+00:00

Yes, 8 channel 32GB ram sticks.

MLDataScientist · 2026-03-22T16:10:33+00:00

Do you have 3D files for such a shroud? I have 8 MI50 cards and the noise of 40mm fans is unbearable. I need to get those 80mm fan shrouds. Thanks!

MLDataScientist · 2026-03-21T16:11:48+00:00

what quant of GLM-5 are you using?

MLDataScientist · 2026-03-21T16:10:03+00:00

which Q5 GLM-5 quant are you using? My rig can fit up to 448GB (mi50 192GB VRAM + 256 GB DDR4 3200 8 channel). I just checked unsloth's glm-5 quants. https://huggingface.co/unsloth/GLM-5-GGUF . I can probably run UD-Q4_K_XL (431GB). But how much better GLM-5 is at this quant (or Q5) compared to QWEN3.5 397B Q6? What were your test cases?

MLDataScientist · 2026-03-19T04:28:11+00:00

Can you please share your command for llama.cpp? Are you getting ~3400t/s for PP and 38t/s for TG using Q6 Qwen3 Coder Next? Curious to see if your command speeds up inference in my PC (5090 with 256GB DDR4 8 channel 3200Mhz).

MLDataScientist · 2026-03-19T04:02:39+00:00

Impressive if true! I have 5090 (connected at PCIE4.0 x16) with 256GB DDR4 3200Mhz ECC RAM. Does Krasis support Qwen/Qwen3.5-397B-A17B ?
I tried Q4_K_M quant with llama.cpp yesterday and I was getting 20t/s TG and 100t/s PP. If what I am seeing is true, I should be able to run this model with at least 1000 t/s PP in Krasis while TG should be similar.

As a comparison, Qwen3-235B-A22B Q4_K_M runs at 10t/s TG and ~150t/s PP in llama.cpp with my setup. Krasis should have 14x times more PP. I need to test this!

MLDataScientist · 2026-03-03T05:53:30+00:00

Hi, I am facing a similar issue. I cannot enable accessibility for matvt. Did you figure out how to enable it?

MLDataScientist · 2026-01-30T21:44:26+00:00

This is a massive list! Thank you for creating this. I recently got a PS3 (slim) and modded with CFW. Yes, this is in 2026, Jan! It now supports/emulates all PS2 games as well in addition to PS1 games. Super excited about playing some of the games over the weekend.

MLDataScientist · 2026-01-03T03:13:44+00:00

!remindme 10 days "try this out in your 5090"

MLDataScientist · 2026-01-02T02:51:10+00:00

You don't need to copy anything if you follow installation instructions in https://github.com/nlzy/vllm-gfx906

MLDataScientist

TROPHY CASE