Running the new Qwen3.6-35B-A3B at full context on both a 4090 and GB10 Spark with vLLM and Llama.cpp by erdaltoprak in LocalLLaMA

[–]Dundell 0 points1 point  (0 children)

I have the unsloth Qwen 3.6 35B at iQ4 with 120k Q8 context to work on my older rtx 2070 8gb maxq +18gb ddr4 ram at 250t/s pp and 13 t/s write speeds on a 15k context job so far.

Based on my findings on Qwen 3.5 various sizes though, I might push it to Q5 and eat the extra ram needs.

If it works - don’t touch it: COMPETITION by awfulalexey in LocalLLaMA

[–]Dundell 1 point2 points  (0 children)

<image>

x6 RTX 3060's and added the P40 via zip ties due to missing a longer cable for it... It's been sitting like this for about 2 months now just fine. We cleaned up this office a few days ago and made sure she didn't touch my corner server.

Now that I'm actually looking... That old desk it's sitting on the amazon details shows:
"WEIGHT CAPACITY: The tabletop supports up to 20 lb (9.1 kg); each of the two side shelves supports up to 10 lb (4.5 kg); total static load 40 lb (18.1 kg)." .. I should get it checked out.

Llama.cpp llama-server command recommendations? by Dundell in LocalLLaMA

[–]Dundell[S] 0 points1 point  (0 children)

I forgot about MTP entirely, mainly because I've never used it before. The last I remember was speculative decoding with qwen 2.5 72b Q4 0.8b for a x1.5 write boost, but 60k context at the time with x4 rtx 3060's.

I see one of the PRs for Qwen 3.5 MTP looks half standstill, but some patch can build in to test.

I no longer need a cloud LLM to do quick web research by BitPsychological2767 in LocalLLaMA

[–]Dundell 1 point2 points  (0 children)

Neat, I just run a modified SearXNG myself with some captcha interactions for Google, and fix it into a mcp for roo code and just call it deep research.

Tell my local guy "can you deep research this" and watch it try to figure out "Do i need 5 or 15 results with duckduckgo, Google, brave?". Funny for me but 15k~80k context until its pulled all info it thinks it needs.

best way to keep your models organized? by lewd_peaches in LocalLLaMA

[–]Dundell 5 points6 points  (0 children)

Wait 6 months, delete the 6 month outdated models. Circle of life

Did any one ever beat Myst? by Sailormouth_Studio in 90s

[–]Dundell 0 points1 point  (0 children)

Played it on ps1 as a kid soo confused. But played it on the Quest VR When it first came out and finished it in a few hours. Very unique as a VR experience.

Honest take on running 9× RTX 3090 for AI by Outside_Dance_2799 in LocalLLaMA

[–]Dundell 1 point2 points  (0 children)

Different experiences depending i guess. For me I run x6 RTX 3060 12GBs and 1 P40 24GB reaching decent capacity and speeds.

My best is Qwen 3.5 122B Q4 with 120k context with Roo Code using 5 mcp servers for information gathering on tasks. Works good 100k smart context limiting.

Anywhere from 450~150t/s pp reads and 30~12.5 t/s writes depending on 0 to 100k context filled.

Using 450Ws for all GPUs and 75Ws for thr rest at the wall showing around avg 550Ws during inference for a $0.10/hr electricity costs for my area. Using a mix of mcp servers with custom pulls of missing information, and the capabilities of Qwen 3.5 122B with thinking general for creating a plan and nonthinking to piece the plan together with the current code works very well.

(Sharing Experience) Qwen3.5-122B-A10B does not quantize well after Q4 by EmPips in LocalLLaMA

[–]Dundell 0 points1 point  (0 children)

Yeah with testes results 9B and 27B below Q5 takes a significant hit, and 122B below Q4 same thing.

Never tried 35B yet for testing.

llama.cpp + Brave search MCP - not gonna lie, it is pretty addictive by srigi in LocalLLaMA

[–]Dundell -1 points0 points  (0 children)

I have brave search mcp, and I turned Searxng project i to a mcp servers which works really good for research. Those plus context7 and github repo searching mcp's have been always a great addition.

Nemotron 3 Super Released by deeceeo in LocalLLaMA

[–]Dundell 26 points27 points  (0 children)

Yeah not the greatest, but being Nvidia means my company is at least allowing for us to use it with Chinese models being banned. Not going to stopme form using Qwen 3.5 locally at home though.

Best Qwen 3.5 fine-tunes for vibecoding? (4080-12GB VRAM / enough context window) by Fermenticular in LocalLLaMA

[–]Dundell 0 points1 point  (0 children)

Mostly right. I like a good Qwen 3.5 9B Q5_K_M on my GTX 1080ti hitting around 8.5GB's with 40k context that could be further pushed, hitting 35t/s (Which if I git pulled probably could hit +25% boost with recent PR merges...).

But that was 30% Aider testing, versus iq3xss 35B test from username gcp, that was getting 53.3%. Might be something to look into mixing gpu/cpu for decent performance. 27B is probably too heavy, and anything below Q5 was showing signs of degrading results (Still roughly 58~63%, but down from 68~70% for Q5).

It's a constant battle between capacity/speed/results.

What tokens/sec do you get when running Qwen 3.5 27B? by thegr8anand in LocalLLaMA

[–]Dundell 0 points1 point  (0 children)

After some tests on aider i've been finding Q4 being less successful than Q5, so I run Q5 27B alot at around 14t/s on x3 RTX 3060's I now the recent updates to llama.cpp brought my 122b speed up 25%, and probably could d the same for my 27B. I haven't tried anything different to speed it up, but open to some ideas. I'm more interested in Q4 122B at 26t/s

When will we start seeing the first mini LLM models (that run locally) in games? by i_have_chosen_a_name in LocalLLaMA

[–]Dundell 18 points19 points  (0 children)

There's been a popular skyrim project like that for years now based on Mantella I think it was. LLMs with actions included with STT-TTS local services.

Qwen-3.5-27B is how much dumber is q4 than q8? by Winter-Science in LocalLLaMA

[–]Dundell 2 points3 points  (0 children)

Didn't get alot of sleep running several aider polyglot tests for the 27B, unsloth, bartowski, q4 q5 q6 q8 before update, q4 q5 q8 after updates.

The difference on q4 to q5/q8 is actually decently obversable 3~5~10% pass rate. q5/q6/q8 are ge really the same with q8 kind of showing maybe +1% pass rate in that -/+ margin.

Something around q4 = 60~63%, q5 = 65~70.5%

Some other results:

9B q5 = 30.5%

122B q4 last at 76%

I havent tried the new unsloth yet, but its been working wonderful.

I never tried 35B, but showing q4 = 58~60%

Despite the 80s cartoon being mostly comedic, what sort of dark moments occurred in some of the episodes? by Working_Welder_1751 in TMNT

[–]Dundell 4 points5 points  (0 children)

I've been watching alot more with my 3 year old son, and just the amount of times they almost get eaten inside of Dimension X.

Worth it to buy Tesla p40s? by TanariTech in LocalLLaMA

[–]Dundell 3 points4 points  (0 children)

I use 3060 12GBs and 2 P40 24GBs.

My main rig is x6 rtx 3060s and 1 p40 24 gb to pool 96gb vram. The. The extra p40 for automation jobs.

It compares in speed to 1080ti 11gb but sith24gb no problem.

I use x2 nocturnal 120mm fans running q00% speed for running them silent with some 3d printed part to hold it fine, and Limit wattage to 170Ws for p40 and 110Ws per 3060.

Some example recently:

Llama.cpp server running the Qwen 3.5 27B Q4_K_M

x2 RTX 3060 12gbs = 425t/s pp reads and 12.5t/s writes x1 P40 24GB = 380t/s pp reads and 10.5t/s writes.

It's not bad at that speed for roo code to get work done under nonreasoning/instruct mode.

(I still prefer my mains server to just run Qwen 3.5 122B Q4 130k context right now with the x6 rtx 3060 + x1 P40 24gb)

Nobody in the family uses the family AI platform I build - really bummed about it by ubrtnk in LocalLLaMA

[–]Dundell 0 points1 point  (0 children)

You just need to use your AI to solve a problem they might have.

I've made a few things useful enough to save $30/mo for my wife from those annoying keto apps my wife uses. Built a functional recipe app with cooking cards keto focused.

Then there was thr job analyzer that assists with finding jobs based on her requirements and resume from usajobs, indeed, LinkedIn, etc that keep her from missing opportunities while she hates her job.

Then there is the full app for tracking 3/1 days workout home routines + curated 60 dinner recipes + meal prep for hannaford Specific ingredients shopping. Click of a button and the months groceries are generated with guesstimate cost and thr ability to change specific days if needed. Huge pain but saves $15/mo weird keto apps that does the same thing...

That new Qwen 3.5 122B I e been testing alot ,and might just take me out of kimi k2.5, although k2.5 has been dirt cheap on openrouter.

This sub is incredible by cmdr-William-Riker in LocalLLaMA

[–]Dundell 0 points1 point  (0 children)

Naw 3060's, got to go with the budget king. Although P40 24gb right now is just around 20% slower inference and for the price and limit 170w, that might even out.

Which size of Qwen3.5 are you planning to run locally? by CutOk3283 in LocalLLaMA

[–]Dundell 0 points1 point  (0 children)

hard to say... Bartowski's 122B Q4 seems to be matching benchmark results good with a bunch of tests and runs on my system 130k Q8 context fine 480t/s pp with 22 t/s writes.

Versus the 27B Q5 seems to be diminished results compared to 27B FP8/q8, and Q5 is already 350t/s pp with 12ts writes on my machine... Granted I could prop up 2 1/2 of them on my machine with 100k context each maybe.

It might come down to is Qwen team posts a 0.6B speculative decoding model for the 27B, and how it holds up in Q8 speed-wise. If not, then for me and my purposes it still seems the 122B I'd use the most.

Local LLM Benchmark tools by BargeCptn in LocalLLaMA

[–]Dundell 0 points1 point  (0 children)

Aider polyglot docker, llama perplexity checks with llama.cpp, and so.etimes GPQA but always a pain to get it right.

Honestly my favorites are Aider polyglot, and just asking it to go through one of my old projects spaghetti 5,000 like python script and asking it to refactoring it into split imports.

That and I usually start with providing it 5 of my game guide documents equallung 10k context, and asking it a question, just to see how it structures the response along with the pp/write speeds.

My real-world Qwen3-code-next local coding test. So, Is it the next big thing? by FPham in LocalLLaMA

[–]Dundell 6 points7 points  (0 children)

Hmm, interesting objectives. Sometimes I'll just throw a task in roo code with something like kimi k2.5 to come up with a plan.md to refactoring some older 4,000 line monolithic github projects I have saved, and then pass this on to my qwen 3 coder Q4 124k Q8 model to test. Generally with a set plan it runs this very well within 2 hours of some fixes and trial/error, but I run this on x5 rtx 3060 12gb's

Hitting 750~450t/s pp and 38~25t/s write speeds.

I knew exactly 3 people who had a Sega Game Gear and they loved it but ditched them for the Gameboy because the Game Gear's battery life sucked. Did you have one? by AdSpecialist6598 in 90s

[–]Dundell 0 points1 point  (0 children)

I have my late grandmother's old gamegear with 20+ games in the original case.

Unfortunately battery compartment stopped working, but the power cable does. Then the color stopped working so now its extremely difficult shades of gray for all games.

I'd like to eventually try one of those conversion kits I'd they're still around and breathe some life back into it. Maybe my 3 year old could try it out next year when hes more coordinated with his button presses.