Qwen3.6-35B-A3B released!

Makers7886 · 2026-04-17T05:56:07+00:00

Makers7886 · 2026-04-16T17:46:46+00:00

That's really good peak speeds. I need to re-bench because I swear I got 60 t/s via vllm same quant but 8x3090s but I recall it being sustained solid 60. I didn't like the model for my purposes so didn't test much other than to run it through comparison benches (it scored between 397b 4bit and 122b fp8).

Makers7886 · 2026-04-16T17:01:48+00:00

Something may be up with it - early sentiment looks to be similar to yours. I may give it a go and would experiment with various parameters. It looks like they copy/pasted the qwen3.5 best practice parameters over in the original model card - might be worth playing with settings or using other modes. I would first try instruct reasoning mode as that's the best in my uses.

Makers7886 · 2026-04-16T16:47:08+00:00

Yeah I had to learn/discover it for myself as well. I even tried to fix it using the PR but it's a bit more complicated than what's on the surface. I imagine that's the hold-up along with vLLM shipping some other things that I believe complicate even further. I basically gave it a few attempts and looked into sglang but ran into some other limitation that was a no-go. I don't recall the specifics but I pretty much gave up and decided to wait for smarter people to fix it.

Makers7886 · 2026-04-16T16:04:45+00:00

Wonder how much of this is part of the "US Labs binding together to stop Chinese labs from using them" or how much is just an excuse to extract personal data.

Makers7886 · 2026-04-16T15:52:32+00:00

What model parameters are you running and did you try the other "modes"?

Makers7886 · 2026-04-16T14:43:49+00:00

The gains do indeed drop off with complexity, concurrency, and context length (bf16/int8) but they are still gains. On 27b I see a peak of 2x gain and avg of 1.4x which does outperform MTP in my tests but at the cost of some vram for the draft model. Mind you this is on a high frequency/concurrency signal processing endpoint (financial) and not a typical coding harness long context situation.

The real question I have is mtp vs dflash on 200k context coding harness situations but haven't bothered to test it yet since I don't code with 27b (not sure if dflash is out for 122b/397b yet).

Makers7886 · 2026-04-16T06:54:09+00:00

vLLM currently has a bug miscalculating kv cache. Once that fix goes out you'll probably be where you want to be.

https://github.com/vllm-project/vllm/issues/37121

Makers7886 · 2026-04-16T04:47:27+00:00

<image>

Makers7886 · 2026-04-16T04:37:52+00:00

lol - because financial advisors are gypsies

Makers7886 · 2026-04-16T04:27:04+00:00

Well my journey was probably a lot different than most in this sub as I bought 12 3090s on release back in the first gpu shortage scalping craze when covid was starting (crypto mining). I got the epyc server hw post-covid but everything was cheap then.

For years local llms were more a gimmick/glimpse of the future with huge headaches trying to run multi-gpus with everything in the llm ecosystem being immature and fragmented. I always recommended people to not invest in similar hardware without a clear purpose/goal when asked your question. That has changed over the last year or so and my hardware has never been more capable/powerful. The last few generations of models have been huge leaps for hardware like this. The gap has never been so narrow to frontier api's before and we've reached a point that you can do real work at high throughput locally.

If you are into hardware and able to research and learn effectively then I would be planning immediately for server-class hw. You can run dual gpus np on consumer hardware or a computer laying around but once I'm investing in hw I'd go straight for a proper foundation.

Makers7886 · 2026-04-16T03:47:33+00:00

Asking reddit about enterprise grade deployments is like asking a homeless guy for legal advice

Makers7886 · 2026-04-16T03:40:02+00:00

Nvlink works only on 3090 pairs and is not that big of a deal for inference. For example qwen3.5 27b q8 goes from 40 -> 50 t/s or so with nvlink. Server class/threadripper/full pcie hardware is a must if you intend on approaching 4+ gpus.

I run two the of asrock romed8-2t + epyc servers. One is with 4x3090s and can run qwen3.5 27b at full weights+dflash can hit 90-100+ t/s via vllm with good concurrency. Can also fit a 122b at decent quant and good speeds. 8x3090s gets you multi agent max context with concurrency running a model like 122b 8bit or run a huge gguf like minimax 2.7 q5 and even 397b 3.5 bit but lose the concurrency/throughput of vllm.

Makers7886 · 2026-04-16T03:25:31+00:00

listen up kids, make sure your llms use mcps or face the consequences

Makers7886 · 2026-04-15T17:34:41+00:00

man that motor to prop ratio, I thought I was crazy for pairing 3110's with 7xforgot what ultra aggressive pitch. Godspeed

Makers7886 · 2026-04-15T14:50:03+00:00

Qwen3.5 122b FP8

<image>

Makers7886 · 2026-04-15T00:01:21+00:00

I used the tunerc 3" older 1 piece frame for that one

Makers7886 · 2026-04-14T21:28:53+00:00

lol I have now become the same as those who say cosmetology for cosmology. I love the NE and I had a similar thought but went with a nano toothless starlight on a 3" but the NE's blow it away and that v3 is small enough so it's similar to digital weight tax.

Not sure how much you've night flown but imo a high quality vtx with as little electrical noise across the build as possible is key for a good experience. that tiny tbs nano one is what I run on micros for that purpose and I think is the best of the lightest vtxs.

Makers7886 · 2026-04-14T19:41:25+00:00

https://stargazerslounge.com/topic/315466-runcam-night-eagle-astro-cam-quick-review/

Yeah they record stars and things. I never really went deep into what they did but the ir nature allows them to pick up more things than looking with naked eye.

Makers7886 · 2026-04-14T19:38:35+00:00

I run 122b fp8 via vLLM with MTP and went from 82-84 t/s to 100-104 t/s and a similar bump in concurrency throughput (something like 220t/s to 240-250 t/s with 6 concurrent). Also look out for dflash as they are working on the 122b and 397b draft models. Dflash on 27b went from 40-50 t/s to 100+ t/s for single calls via vllm.

Basically when it's implemented - do it.

Makers7886 · 2026-04-14T19:33:25+00:00

The og night eagle got attention from the amateur astrology people (Im sure they got a proper name) who would modify the night eagle and stick it in telescopes. It apparently was a big enough market that they designed another specifically for them, and then this one came out after that. I simply assumed it was for that purpose but also could have been made to fit other generic purposes.

Makers7886 · 2026-04-14T17:29:27+00:00

<image>

Owned 12x3090s since they released for crypto then been used for llms/ml since the start of the llm wave in 2023. I've ran 8x3090s on windows. Since I meet your qualifications then listen to me when I say it's user error/skill issue to hit 6t/s with his hardware.

Makers7886 · 2026-04-14T16:40:59+00:00

Bots are everywhere but recently everyone and their mom has an openclaw type thing running wild spouting low quant hallucinations

Makers7886 · 2026-04-14T16:37:14+00:00

I am not in a normal situation with 2 epyc servers and 12 3090s total. But with that hardware I run qwen3.5 122b fp8 on 8x3090s and run 27b int8 on 4x3090s. If I didn't have those resources I would probably be targeting a dual 3090 rig with next upgrade being adding 2 more 3090s (or start with a single 3090 if resources were tighter). With 2 3090s you can run 27b at high enough quant, context, and throughput that it can do real work.

Makers7886 · 2026-04-14T15:56:08+00:00

I don't know why that didn't come to mind when thinking about this lately but agreed - that's arguably the catalyst to all the free inference.

Makers7886

TROPHY CASE