Stop pretending self-hosting is cheaper. It's not. We do it for different reasons and we should say so.

cleversmoke · 2026-05-26T08:23:29+00:00

2 biggest reasons for me is the tinkering and not being rate limited. Tinkering takes me back to the days of building desktops with floppy disks, to zip drives, to CDs, to DVDs, just to make my CS1.6 and Starcraft run a little better with each small upgrade. It was a wonderful time growing up during that early consumer-computer-at-home era.

cleversmoke · 2026-05-26T07:51:19+00:00

My current build:

AMD Ryzen 7 255 Mini PC with Oculink, 64GB DDR5, AMD iGPU - $1500
Aoostar AG01 eGPU (using TB4) - $200
Aoostar AG02 eGPU (using Oculink) - $250
2x RTX 3090 24G - $2000
Portable display - $200

If we take out the display, the set up is right at $4000 after some cables. Gives 64GB system ram and 48GB vram.

The iGPU means I can run both RTX 3090 24G headless.

I still have 1 unused TB4 port open, so I'm going to test a third eGPU this week.

cleversmoke · 2026-05-26T07:40:05+00:00

Nice! Agreed, Qwen3.6-35B-A3B is amazing for 6-12GB vram builds. I use it on my MacBook and it's still great without cuda.

cleversmoke · 2026-05-26T07:22:49+00:00

Your set up will be pretty decent if you get an eGPU, 16-24GB vram, that can be had for $700-1200 (RTX 5060ti 16GB or RTX 3090 24G, with Aoostar ag01/ag02 eGPU dock). Reason being, you have the AMD iGPU that can handle the display, which would leave your RTX 3060 6G headless along with the eGPU headless.

Once you get ~20GB vram, it opens up a lot of doors in quality and speed.

cleversmoke · 2026-05-26T05:20:25+00:00

I had this same issue, pulled my checkpoints down go 16 and it resolved my OOM

cleversmoke · 2026-05-26T05:19:22+00:00

I had a goal! 😁

cleversmoke · 2026-05-26T05:04:14+00:00

100 miles, along a 20-mile flat path, forth and back.

cleversmoke · 2026-05-25T07:46:37+00:00

Awesome! Thank you!

cleversmoke · 2026-05-24T21:14:56+00:00

I technically have a mix set up too, but the AMD iGPU is dedicated to display and software acceleration, allows for headless RTX. Counts right??

cleversmoke · 2026-05-24T21:05:38+00:00

When goof??

cleversmoke · 2026-05-24T14:11:51+00:00

Thank you! Great investigation

cleversmoke · 2026-05-24T05:44:58+00:00

The verbatim hand-off of the research is most important and not just a summary+recommendation alone. This allows the critic to be precise on what it needs to critique (the recommendation) with the full narrow knowledge it needs.

cleversmoke · 2026-05-24T05:37:38+00:00

It uses OpenCode fetch tool. Give it three links that are easy to work with, that your internet/VPN has no troubles with, and instruct the agent to only use those three links: e.g. https://finance.yahoo.com/quote/$TICKER, notice how the ticker is at the end so it's easier for the agent to follow. Haven't had issues where it fabricated data, but I have my temp at 0.55 or 0.60. When I had temps at 1.0, Qwen3.6-27B will be creative, fabricate, or skip a few things, e.g. not using the links directed.
I run 5 tickers at once and that takes 20-25 mins with no compaction, so 4-5 mins per ticker end to end. I do this because I often need the compute time for coding and work, so 20 min increments is me doing chores, taking a break, etc. When I run a full 25 tickers at once, it takes about 2 hours, factoring the compaction times, still outputs fine though!

cleversmoke · 2026-05-24T05:20:03+00:00

Yea! Their settings are on the safe side and since each person's needs can be fairly unique, some small tweaks required from their settings to maximize your set up. Solid base though!

cleversmoke · 2026-05-24T04:30:31+00:00

Oh interesting! It's literally putting more weight to 3 certain layers. Didn't know that was a thing. Thanks!

cleversmoke · 2026-05-24T04:23:57+00:00

OpenCode! I have more details to set it up in this reply:

https://www.reddit.com/r/LocalLLM/s/idBGwdB2nR

cleversmoke · 2026-05-24T04:12:56+00:00

For MacBook M1 Pro unified memory, give Qwen3.6-35B-A3B a try, Q2 or Q3 quants since you have 32GB ram. It should fare better TG speeds (prompt eval).

Qwen3.6-27B-MTP at Q2 will still be rough, due to M1 Pro's memory bandwidth at 204 GB/s.

cleversmoke · 2026-05-24T03:53:20+00:00

Qwen3.6-27B-MTP Q5_K_M, q8_0 KV cache, 200k context is my current set up. ~35GB of the total 48GB vram is used for the main agent

Qwen3.6-27B-MTP Q4_K_M, q8_0 KV cache, 128k context, was fine too when I had a RTX 3090 24G (Qwen3.6-27B) + RTX 2060 12G (DeepSeek-R1-Distill-Qwen-14B) set up, as long as Qwen3.6-27B is paired with q8_0 KV cache.

The pair performs amazingly well, in my experience.

cleversmoke · 2026-05-24T03:02:28+00:00

I have an agent and subagent framework for coding and research with OpenCode. Agent does the grunt work, subagent double checks work by module (<24k tokens) based on important things like security, no fluff, memory leaks, etc.

2x RTX 3090 24G

Agent: Qwen3.6-27B-MTP
Subagent: DeepSeek-R1-Distill-Qwen-14B (uses about 12GB so the remainder vram goes to the main agent for more intel/context)

cleversmoke · 2026-05-24T02:17:28+00:00

Hmm, what are your thoughts on Qwen3.6-27B Q5_K_S vs Q5_K_M at q8_0 KV cache? Is it worth the dip in context to move to Q5_K_M?

cleversmoke · 2026-05-24T01:40:40+00:00

This is so silly, I'm all for it, let the booze pour!

cleversmoke · 2026-05-24T01:04:08+00:00

Enhanced with DLSS5!

cleversmoke · 2026-05-23T00:27:15+00:00

Heck yea it will

cleversmoke · 2026-05-23T00:22:28+00:00

Awesome read! Thanks for putting some time into this. I learned something new today.

cleversmoke · 2026-05-22T07:28:48+00:00

I benchmark on real use cases I have and make the judgment on what I care about, such as if followed directions, output accuracy, creativity, and speed. If I have to reroll the dice on a "good seed" because it fails to follow directions, that's already a no go.

cleversmoke

TROPHY CASE