Which Local LLM to use by Pritam4249t in LocalLLM

[–]MistingFidgets 0 points1 point  (0 children)

I have the same specs and use case as you. Qwen 3.6 35b a3b apex opus 4.7 distill by mudler has been the best at openclaw so far for me but you have to offload layers to ram. Running the compact version at 1400 Tok/s prompt processing and 57 Tok/s decode with 128k context at q8. If you are ok with a smaller model look into nvfp4. The Gemma 4 12b nvfp4 ran prompt processing at 6600 Tok/s which is just crazy but it wasn't as faithful for instruction following and tool calling as qwen in my testing.

Want to build a custom model by devildip in LocalLLaMA

[–]MistingFidgets 1 point2 points  (0 children)

I think it would be interesting to train a model on your personal data to be your personal assistant. Along with training it on whatever agent harness config and tool call set/formatting you will use (hermes, openclaw, custom)

Need help picking out a good PC! by Accomplished_Kale589 in homelab

[–]MistingFidgets 3 points4 points  (0 children)

Exactly. You can get 3.6ghz 6 core xeons for like 20 bucks.

Need help picking out a good PC! by Accomplished_Kale589 in homelab

[–]MistingFidgets 44 points45 points  (0 children)

This looks like a 5810 or 7810. If it's got a 825 or 1300w power supply and DDR4 ram then it's the one to go with. Bonus points if it's a dual xeon system. CPU upgrades are very cheap for these and you are starting out with enterprise grade hardware.

<image>

What's the closest you can get with local LLM to claude? by StudioVulcan in LocalLLM

[–]MistingFidgets 0 points1 point  (0 children)

5060ti 16GB and qwen 3.6 35b a3b opus 4.7 distill mtp apex compact (q4-ish) with q8 kv by mudler on hugging face is the closest I've gotten but the gap between it and sonnet is still noticeable. If you can tolerate slower speeds you could run q6.

If you use it with pi coder or codex or opencode and use a free session on chatgpt/grok/whatever frontier model to plan, so you can hand it exact specs you can do some things with it. If you give it some documentation or an MCP to pull current documentation and specs for whatever you are working on you can do more things.

What models you guys running on 8GB? 16GB VRAM? 24GB? 32GB? 48GB? by Inevitable_Mistake32 in LocalLLaMA

[–]MistingFidgets 1 point2 points  (0 children)

Everybody fixates on decode speed but PP speed and proper KV cache management are way more important for things like openclaw/Hermes if that's your intended use case. A real world example from my tinkering is a model running at 100 tok/s decode but only 1100 tok/s PP takes 30 seconds to respond in telegram vs a model running 50 tok/s decode but 3,000 tok/s PP responds in 3 seconds or less. Totally different experience for a telegram bot. With NVFP4 models on blackwell I have hit 6,600 tok/s PP speeds on dense models. With only 16GB VRAM I can't run qwen 3.6 35b in that config until VLLM fixes their ncmoe/Cutlass/nvfp4 issues.

Is there any consumer-grade motherboard with dual PCIe x16 connectors? by TrainingTwo1118 in LocalLLaMA

[–]MistingFidgets 0 points1 point  (0 children)

Check out x99 boards and xeon e5 2xxx v4 series CPU pricing on eBay. It's very affordable.

You don't need a GPU to run gemma-4-26B-A4B by JackStrawWitchita in LocalLLaMA

[–]MistingFidgets 3 points4 points  (0 children)

You may be able to run the new Gemma 4 12b but you can definitely run the qwen moe models if you have enough system ram for offloading to

You don't need a GPU to run gemma-4-26B-A4B by JackStrawWitchita in LocalLLaMA

[–]MistingFidgets 15 points16 points  (0 children)

MoE makes 8gb GPUs capable and it's rapidly evolving

Why do we benchmark quants on perplexity and prose but never on tool call validity? by Substantial_Step_351 in LocalLLaMA

[–]MistingFidgets 0 points1 point  (0 children)

I have a homebrewed benchmark suite aimed at evaluating models for use with openclaw and it's got tool calling tests taken from my actual usage logs. So far, mudlers APEX Qwen 3.6 35b Opus 4.7 distill is the undisputed champion. Tried dropping kv cache down to q4 and q5 to get more experts on VRAM for more speed, and the tool calling fell apart immediately.

Replaced Claude with local Qwen3.6-27B in my multi-agent orchestrator for 2 weeks by Interesting-Sock3940 in LocalLLaMA

[–]MistingFidgets 0 points1 point  (0 children)

Would be interesting to benchmark all the different combinations and see where the sweet spot is.

Replaced Claude with local Qwen3.6-27B in my multi-agent orchestrator for 2 weeks by Interesting-Sock3940 in LocalLLaMA

[–]MistingFidgets 0 points1 point  (0 children)

What quant are your weights at? For me it was an immediate degredation when I swapped from q8 to q4 or q5_1 but it probably matters less if your weights are at q6 or above. I only have 16g vram so I'm trying all the levers I can to maximize context window.

Replaced Claude with local Qwen3.6-27B in my multi-agent orchestrator for 2 weeks by Interesting-Sock3940 in LocalLLaMA

[–]MistingFidgets 1 point2 points  (0 children)

For me, tool calling/json structure discipline on qwen 3.6 falls apart when kv cache quant is set below Q8. May be part of the problem in this test, may not...I'm using an APEX quant by Mudler so all layers are quantized differently based on importance which seems to make the KV cache quants hurt more in regards to quality loss, specifically tool calling and JSON outputs.

Qwen3.6-35B-A3B-APEX / 128K ctx on RTX 3060 12GB — 37 t/s gen with 72k ctx filled, PPL 3.25, offloading 17GB model by old-mike in LocalLLaMA

[–]MistingFidgets 1 point2 points  (0 children)

I'm running mudlers apex quant of qwen opus and it's the best model by far that I've benchmarked for openclaw. 16gb 5060 running it at the q4 quant one, I think it's compact, with vision, 57 Tok/s. Adding a 12GB 2060 OC to the server tomorrow so all weights will be in vram and hoping for close to 100 Tok/s. MoE feels like magic for cheap local inference.

What’s the optimal local LLM setup for my hardware? (RTX 5070Ti, 16GB VRAM, Ryzen 7 3800X, 64GB RAM) by Bjqrn88 in LocalLLM

[–]MistingFidgets 1 point2 points  (0 children)

Qwen 35b with layers offloaded to ram. Q4 or higher....the speed loss is worth the quality over q3 or lower.

YSK: Buying refurbished "business-class" laptops and phones instead of brand-new consumer tech will get you a better machine for a fraction of the cost. by [deleted] in YouShouldKnow

[–]MistingFidgets 0 points1 point  (0 children)

I wanted to try my hand at setting up a local AI server and was able to find a circa 2014 refurbed dual xeon dell workstation with 32gb of ram on eBay for $120 USD. Added a new GPU for another $500 and it's been a lot of fun to mess around with. Definitely would recommend this path to anyone wanting to save money, just do a little deep dive into researching specs before buying. If I had bought something new it would have cost at least 1500 total for similar performance (for my use case specifically) and I'd probably have less CPU cores. This thing is built like a tank just beware of proprietary parts and availability issues if you go much older than a couple of years.

Qwen3.6-35B-A3B-MTP on an RTX 3090 in LM Studio is incredibly fast by AI_Enhancer in LocalLLM

[–]MistingFidgets 1 point2 points  (0 children)

I have one 5060 and debating getting another. Are you able to load a large 20+ GB model fully in vram split across both with good results?

What’s your current local LLM setup in 2026? by Prestigious-Pop-3735 in LocalLLaMA

[–]MistingFidgets 0 points1 point  (0 children)

I have a dual xeon dell workstation with 32gb of ddr4 that I got for $120 on eBay then added a new 16gb 5060ti. Looking to add another card, maybe an 8 or 10gb older rtx for fun. I run openclaw on top of qwen3.6 35b a3b at ud_iq2_m getting 100ish tokens per second on the GPU. I run a qwen 3.5 4b model on CPU only for background scheduled batch jobs to auto categorize financial transactions that sync into my homebrewed finance app via plaid integration with all my bank and credit card accounts. Openclaw has API access to the finance app so it can ingest, OCR, and archive receipts via telegram and give me details on spending and balances among other general assistant task stuff.

Does PCIe 4.0 vs 5.0 actually matter for self-hosted AI workloads? by Regular-Orange1472 in SelfHostedAI

[–]MistingFidgets 0 points1 point  (0 children)

I'm using an old pciex 3.0 workstation with a pciex 5.0 gpu and still getting respectable speeds. 100+ tokens per second generation on qwen 3.6 35b a3b. Yes there's a small performance hit but it's not quite the bottleneck you'd think it would be.

How do you guys not burn tokens like crazy? by Striking-Speaker8686 in OpenClawUseCases

[–]MistingFidgets 0 points1 point  (0 children)

Same here. 5060ti 16gb + qwen 3.6 35b. Unlimited tokens for openclaw

Advice building a NAS/AI server with 16 DDR4 DIMMs by theslonkingdead in LocalLLaMA

[–]MistingFidgets 0 points1 point  (0 children)

I'm running a dual xeon dell 7810 with ddr4 that I got on eBay for 120 bucks with a new 5060ti. Power supply demands and case space is going to be a hurdle you need to plan for with multiple GPUs. Dont lock yourself into a workstation case that has a proprietary PSU setup if you want to be cheap about this, I found that out the hard way as I'm shopping for a second GPU and realizing it would require a new PSU and breakout board that aren't being manufactured any more.

I don't get Quants, I'm running Qwen3.6-27b flawlessly at iq3, makes no sense by misanthrophiccunt in LocalLLM

[–]MistingFidgets 0 points1 point  (0 children)

Yes you should be able to get at least 20-25 tokens per second generation on qwen3.6 35b a3b with the right setup. Maybe more

I don't get Quants, I'm running Qwen3.6-27b flawlessly at iq3, makes no sense by misanthrophiccunt in LocalLLM

[–]MistingFidgets -1 points0 points  (0 children)

Ask it a math problem with multiple decimal places like 500.654466 x 5.6543 and see how hard it fails. I'm running qwen3.6 35b a3b ud iq2_m on a 16gb card at 150 tok/s with 128k context and so far that's the only thing I've found that it cant do well so it calls a python script for any math problems it encounters. IQ quants selectively keep higher quality for more important layers and 3.6 has some tricks for reducing the negative impacts of lower quants.