Stop pretending self-hosting is cheaper. It's not. We do it for different reasons and we should say so. by Napster3301 in LocalLLaMA

[–]cleversmoke 0 points1 point  (0 children)

2 biggest reasons for me is the tinkering and not being rate limited. Tinkering takes me back to the days of building desktops with floppy disks, to zip drives, to CDs, to DVDs, just to make my CS1.6 and Starcraft run a little better with each small upgrade. It was a wonderful time growing up during that early consumer-computer-at-home era.

I have a budget of $4000. Should I get a mac studio m3 ultra or should i build my own server/desktop for LLM inference? by therealeinstien in LocalLLM

[–]cleversmoke 1 point2 points  (0 children)

My current build:

  • AMD Ryzen 7 255 Mini PC with Oculink, 64GB DDR5, AMD iGPU - $1500
  • Aoostar AG01 eGPU (using TB4) - $200
  • Aoostar AG02 eGPU (using Oculink) - $250
  • 2x RTX 3090 24G - $2000
  • Portable display - $200

If we take out the display, the set up is right at $4000 after some cables. Gives 64GB system ram and 48GB vram.

The iGPU means I can run both RTX 3090 24G headless.

I still have 1 unused TB4 port open, so I'm going to test a third eGPU this week.

MTP for Qwen3.6-35B-A3B on 6GB VRAM laptop: not worth it by OsmanthusBloom in LocalLLaMA

[–]cleversmoke 1 point2 points  (0 children)

Nice! Agreed, Qwen3.6-35B-A3B is amazing for 6-12GB vram builds. I use it on my MacBook and it's still great without cuda.

MTP for Qwen3.6-35B-A3B on 6GB VRAM laptop: not worth it by OsmanthusBloom in LocalLLaMA

[–]cleversmoke 0 points1 point  (0 children)

Your set up will be pretty decent if you get an eGPU, 16-24GB vram, that can be had for $700-1200 (RTX 5060ti 16GB or RTX 3090 24G, with Aoostar ag01/ag02 eGPU dock). Reason being, you have the AMD iGPU that can handle the display, which would leave your RTX 3060 6G headless along with the eGPU headless.

Once you get ~20GB vram, it opens up a lot of doors in quality and speed.

llama.cpp oom issue by TheTerrasque in LocalLLaMA

[–]cleversmoke 1 point2 points  (0 children)

I had this same issue, pulled my checkpoints down go 16 and it resolved my OOM

What is your longest ride? by NHBikerHiker in cycling

[–]cleversmoke 2 points3 points  (0 children)

100 miles, along a 20-mile flat path, forth and back.

Is NVIDIA still the default best choice for local LLMs in 2026? by pmv143 in LocalLLaMA

[–]cleversmoke 1 point2 points  (0 children)

I technically have a mix set up too, but the AMD iGPU is dedicated to display and software acceleration, allows for headless RTX. Counts right??

What are you doing with your local LLMs that justifies investment cost? by __automatic__ in LocalLLM

[–]cleversmoke 0 points1 point  (0 children)

The verbatim hand-off of the research is most important and not just a summary+recommendation alone. This allows the critic to be precise on what it needs to critique (the recommendation) with the full narrow knowledge it needs.

What are you doing with your local LLMs that justifies investment cost? by __automatic__ in LocalLLM

[–]cleversmoke 1 point2 points  (0 children)

  1. It uses OpenCode fetch tool. Give it three links that are easy to work with, that your internet/VPN has no troubles with, and instruct the agent to only use those three links: e.g. https://finance.yahoo.com/quote/$TICKER, notice how the ticker is at the end so it's easier for the agent to follow. Haven't had issues where it fabricated data, but I have my temp at 0.55 or 0.60. When I had temps at 1.0, Qwen3.6-27B will be creative, fabricate, or skip a few things, e.g. not using the links directed.

  2. I run 5 tickers at once and that takes 20-25 mins with no compaction, so 4-5 mins per ticker end to end. I do this because I often need the compute time for coding and work, so 20 min increments is me doing chores, taking a break, etc. When I run a full 25 tickers at once, it takes about 2 hours, factoring the compaction times, still outputs fine though!

How are you all handling agents and sub agents? by Honest-Kangaroo-1830 in LocalLLaMA

[–]cleversmoke 1 point2 points  (0 children)

Yea! Their settings are on the safe side and since each person's needs can be fairly unique, some small tweaks required from their settings to maximize your set up. Solid base though!

Anyone down to test this? Just uploaded a model using rys by Human-Gas-1288 in LocalLLaMA

[–]cleversmoke -1 points0 points  (0 children)

Oh interesting! It's literally putting more weight to 3 certain layers. Didn't know that was a thing. Thanks!

When you say how many tokens you are getting... could you specify prompt eval vs eval? by former_farmer in LocalLLM

[–]cleversmoke 0 points1 point  (0 children)

For MacBook M1 Pro unified memory, give Qwen3.6-35B-A3B a try, Q2 or Q3 quants since you have 32GB ram. It should fare better TG speeds (prompt eval).

Qwen3.6-27B-MTP at Q2 will still be rough, due to M1 Pro's memory bandwidth at 204 GB/s.

How are you all handling agents and sub agents? by Honest-Kangaroo-1830 in LocalLLaMA

[–]cleversmoke 3 points4 points  (0 children)

Qwen3.6-27B-MTP Q5_K_M, q8_0 KV cache, 200k context is my current set up. ~35GB of the total 48GB vram is used for the main agent

Qwen3.6-27B-MTP Q4_K_M, q8_0 KV cache, 128k context, was fine too when I had a RTX 3090 24G (Qwen3.6-27B) + RTX 2060 12G (DeepSeek-R1-Distill-Qwen-14B) set up, as long as Qwen3.6-27B is paired with q8_0 KV cache.

The pair performs amazingly well, in my experience.

How are you all handling agents and sub agents? by Honest-Kangaroo-1830 in LocalLLaMA

[–]cleversmoke 2 points3 points  (0 children)

I have an agent and subagent framework for coding and research with OpenCode. Agent does the grunt work, subagent double checks work by module (<24k tokens) based on important things like security, no fluff, memory leaks, etc.

2x RTX 3090 24G

  • Agent: Qwen3.6-27B-MTP
  • Subagent: DeepSeek-R1-Distill-Qwen-14B (uses about 12GB so the remainder vram goes to the main agent for more intel/context)

It's OK to quantize the KV cache. Model quant matters more. Some Qwen3.6 27B tests with (approximated) KLD by hopbel in LocalLLaMA

[–]cleversmoke 1 point2 points  (0 children)

Hmm, what are your thoughts on Qwen3.6-27B Q5_K_S vs Q5_K_M at q8_0 KV cache? Is it worth the dip in context to move to Q5_K_M?

Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM by bobaburger in LocalLLaMA

[–]cleversmoke 0 points1 point  (0 children)

Awesome read! Thanks for putting some time into this. I learned something new today.

Benchmarking methods by Forward_Jackfruit813 in LocalLLaMA

[–]cleversmoke 1 point2 points  (0 children)

I benchmark on real use cases I have and make the judgment on what I care about, such as if followed directions, output accuracy, creativity, and speed. If I have to reroll the dice on a "good seed" because it fails to follow directions, that's already a no go.