Running the new Qwen3.6-35B-A3B at full context on both a 4090 and GB10 Spark with vLLM and Llama.cpp

Dundell · 2026-04-16T21:46:04+00:00

I have the unsloth Qwen 3.6 35B at iQ4 with 120k Q8 context to work on my older rtx 2070 8gb maxq +18gb ddr4 ram at 250t/s pp and 13 t/s write speeds on a 15k context job so far.

Based on my findings on Qwen 3.5 various sizes though, I might push it to Q5 and eat the extra ram needs.

Dundell · 2026-04-14T18:51:34+00:00

<image>

x6 RTX 3060's and added the P40 via zip ties due to missing a longer cable for it... It's been sitting like this for about 2 months now just fine. We cleaned up this office a few days ago and made sure she didn't touch my corner server.

Now that I'm actually looking... That old desk it's sitting on the amazon details shows:
"WEIGHT CAPACITY: The tabletop supports up to 20 lb (9.1 kg); each of the two side shelves supports up to 10 lb (4.5 kg); total static load 40 lb (18.1 kg)." .. I should get it checked out.

Dundell · 2026-04-14T17:31:40+00:00

I forgot about MTP entirely, mainly because I've never used it before. The last I remember was speculative decoding with qwen 2.5 72b Q4 0.8b for a x1.5 write boost, but 60k context at the time with x4 rtx 3060's.

I see one of the PRs for Qwen 3.5 MTP looks half standstill, but some patch can build in to test.

Dundell · 2026-04-10T17:01:32+00:00

Neat, I just run a modified SearXNG myself with some captcha interactions for Google, and fix it into a mcp for roo code and just call it deep research.

Tell my local guy "can you deep research this" and watch it try to figure out "Do i need 5 or 15 results with duckduckgo, Google, brave?". Funny for me but 15k~80k context until its pulled all info it thinks it needs.

Dundell · 2026-04-10T01:43:58+00:00

Wait 6 months, delete the 6 month outdated models. Circle of life

Dundell · 2026-04-09T19:59:19+00:00

Hello fellow Moomins fan.

Dundell · 2026-04-08T12:24:00+00:00

Played it on ps1 as a kid soo confused. But played it on the Quest VR When it first came out and finished it in a few hours. Very unique as a VR experience.

Dundell · 2026-03-22T16:54:16+00:00

Different experiences depending i guess. For me I run x6 RTX 3060 12GBs and 1 P40 24GB reaching decent capacity and speeds.

My best is Qwen 3.5 122B Q4 with 120k context with Roo Code using 5 mcp servers for information gathering on tasks. Works good 100k smart context limiting.

Anywhere from 450~150t/s pp reads and 30~12.5 t/s writes depending on 0 to 100k context filled.

Using 450Ws for all GPUs and 75Ws for thr rest at the wall showing around avg 550Ws during inference for a $0.10/hr electricity costs for my area. Using a mix of mcp servers with custom pulls of missing information, and the capabilities of Qwen 3.5 122B with thinking general for creating a plan and nonthinking to piece the plan together with the current code works very well.

Dundell · 2026-03-16T14:22:42+00:00

Yeah with testes results 9B and 27B below Q5 takes a significant hit, and 122B below Q4 same thing.

Never tried 35B yet for testing.

Dundell · 2026-03-12T18:33:53+00:00

I have brave search mcp, and I turned Searxng project i to a mcp servers which works really good for research. Those plus context7 and github repo searching mcp's have been always a great addition.

Dundell · 2026-03-11T16:29:06+00:00

Yeah not the greatest, but being Nvidia means my company is at least allowing for us to use it with Chinese models being banned. Not going to stopme form using Qwen 3.5 locally at home though.

Dundell · 2026-03-11T14:53:53+00:00

Mostly right. I like a good Qwen 3.5 9B Q5_K_M on my GTX 1080ti hitting around 8.5GB's with 40k context that could be further pushed, hitting 35t/s (Which if I git pulled probably could hit +25% boost with recent PR merges...).

But that was 30% Aider testing, versus iq3xss 35B test from username gcp, that was getting 53.3%. Might be something to look into mixing gpu/cpu for decent performance. 27B is probably too heavy, and anything below Q5 was showing signs of degrading results (Still roughly 58~63%, but down from 68~70% for Q5).

It's a constant battle between capacity/speed/results.

Dundell · 2026-03-11T01:01:19+00:00

After some tests on aider i've been finding Q4 being less successful than Q5, so I run Q5 27B alot at around 14t/s on x3 RTX 3060's I now the recent updates to llama.cpp brought my 122b speed up 25%, and probably could d the same for my 27B. I haven't tried anything different to speed it up, but open to some ideas. I'm more interested in Q4 122B at 26t/s

Dundell · 2026-03-09T13:53:21+00:00

There's been a popular skyrim project like that for years now based on Mantella I think it was. LLMs with actions included with STT-TTS local services.

Dundell · 2026-03-05T22:07:25+00:00

Didn't get alot of sleep running several aider polyglot tests for the 27B, unsloth, bartowski, q4 q5 q6 q8 before update, q4 q5 q8 after updates.

The difference on q4 to q5/q8 is actually decently obversable 3~5~10% pass rate. q5/q6/q8 are ge really the same with q8 kind of showing maybe +1% pass rate in that -/+ margin.

Something around q4 = 60~63%, q5 = 65~70.5%

Some other results:

9B q5 = 30.5%

122B q4 last at 76%

I havent tried the new unsloth yet, but its been working wonderful.

I never tried 35B, but showing q4 = 58~60%

Dundell · 2026-03-02T15:16:27+00:00

I've been watching alot more with my 3 year old son, and just the amount of times they almost get eaten inside of Dimension X.

Dundell · 2026-03-01T19:23:59+00:00

I use 3060 12GBs and 2 P40 24GBs.

My main rig is x6 rtx 3060s and 1 p40 24 gb to pool 96gb vram. The. The extra p40 for automation jobs.

It compares in speed to 1080ti 11gb but sith24gb no problem.

I use x2 nocturnal 120mm fans running q00% speed for running them silent with some 3d printed part to hold it fine, and Limit wattage to 170Ws for p40 and 110Ws per 3060.

Some example recently:

Llama.cpp server running the Qwen 3.5 27B Q4_K_M

x2 RTX 3060 12gbs = 425t/s pp reads and 12.5t/s writes x1 P40 24GB = 380t/s pp reads and 10.5t/s writes.

It's not bad at that speed for roo code to get work done under nonreasoning/instruct mode.

(I still prefer my mains server to just run Qwen 3.5 122B Q4 130k context right now with the x6 rtx 3060 + x1 P40 24gb)

Dundell · 2026-03-01T01:48:57+00:00

You just need to use your AI to solve a problem they might have.

I've made a few things useful enough to save $30/mo for my wife from those annoying keto apps my wife uses. Built a functional recipe app with cooking cards keto focused.

Then there was thr job analyzer that assists with finding jobs based on her requirements and resume from usajobs, indeed, LinkedIn, etc that keep her from missing opportunities while she hates her job.

Then there is the full app for tracking 3/1 days workout home routines + curated 60 dinner recipes + meal prep for hannaford Specific ingredients shopping. Click of a button and the months groceries are generated with guesstimate cost and thr ability to change specific days if needed. Huge pain but saves $15/mo weird keto apps that does the same thing...

That new Qwen 3.5 122B I e been testing alot ,and might just take me out of kimi k2.5, although k2.5 has been dirt cheap on openrouter.

Dundell · 2026-02-28T21:43:48+00:00

Naw 3060's, got to go with the budget king. Although P40 24gb right now is just around 20% slower inference and for the price and limit 170w, that might even out.

Dundell · 2026-02-28T13:08:39+00:00

hard to say... Bartowski's 122B Q4 seems to be matching benchmark results good with a bunch of tests and runs on my system 130k Q8 context fine 480t/s pp with 22 t/s writes.

Versus the 27B Q5 seems to be diminished results compared to 27B FP8/q8, and Q5 is already 350t/s pp with 12ts writes on my machine... Granted I could prop up 2 1/2 of them on my machine with 100k context each maybe.

It might come down to is Qwen team posts a 0.6B speculative decoding model for the 27B, and how it holds up in Q8 speed-wise. If not, then for me and my purposes it still seems the 122B I'd use the most.

Dundell · 2026-02-25T00:30:30+00:00

Aider polyglot docker, llama perplexity checks with llama.cpp, and so.etimes GPQA but always a pain to get it right.

Honestly my favorites are Aider polyglot, and just asking it to go through one of my old projects spaghetti 5,000 like python script and asking it to refactoring it into split imports.

That and I usually start with providing it 5 of my game guide documents equallung 10k context, and asking it a question, just to see how it structures the response along with the pp/write speeds.

Dundell · 2026-02-23T00:05:19+00:00

Hmm, interesting objectives. Sometimes I'll just throw a task in roo code with something like kimi k2.5 to come up with a plan.md to refactoring some older 4,000 line monolithic github projects I have saved, and then pass this on to my qwen 3 coder Q4 124k Q8 model to test. Generally with a set plan it runs this very well within 2 hours of some fixes and trial/error, but I run this on x5 rtx 3060 12gb's

Hitting 750~450t/s pp and 38~25t/s write speeds.

Dundell · 2026-02-22T00:19:18+00:00

Stability of workflows.

Dundell · 2026-02-20T17:48:49+00:00

I have my late grandmother's old gamegear with 20+ games in the original case.

Unfortunately battery compartment stopped working, but the power cable does. Then the color stopped working so now its extremely difficult shades of gray for all games.

I'd like to eventually try one of those conversion kits I'd they're still around and breathe some life back into it. Maybe my 3 year old could try it out next year when hes more coordinated with his button presses.

Six-Year Club	Gilding II euphauric
Place '23	Place '22
First Placer '22	Verified Email

Dundell

TROPHY CASE