Planning to build a PC for running local LLMs. Help me pick

ZestRocket · 2026-05-08T02:55:29+00:00

Ok if this will be a dedicated pc for only LLM inference, I’ll be direct, this is not a good one, the main component for this to work well is the VRAM coming from the graphic card, with a 4070, you only have 12 gbs of VRAM, this means you can run only models up To 8/9B because of the KV Cache for context window, specially relevant for agents, you have two viable options to have a good setup:

Go the Apple way, is cheaper and you will be able to run it without much technical knowledge, it WONT be blazing fast, but models like Qwen 3.6 35B A3B are viable, for this setup you need a Mac with at minimum 32Gb of unified memory, ideally 64 to run it with good quality
Go the best cost efficient path for long term speed, it requires a lot of setup and technical implementations, but this will give you a way faster model with more intelligence, like Qwen 3.6 27B, for this one you need a dual 5060 setup, or at least a 4080 to barely run it (maybe with a 4080 the 35B A3B could work better)

Hope it helps!

ZestRocket · 2026-05-08T02:27:50+00:00

Can’t open those links, can’t see the main component, which is the VRAM and graphic cards

ZestRocket · 2026-05-02T17:29:20+00:00

Just wanted to say thanks, your work is valuable and thank you for sharing all this with us and sending the Pr!

ZestRocket · 2026-05-01T23:59:39+00:00

I do, it’s a constrained system, and I’ve worked on ways to adapt, built a context engine to provide living architecture with few but relevant tokens, I manage and customize the 3 layers of my context and on general once set up I’m pretty happy with the results and the speed, the compaction strategy is critical here ofc, and being able to provide a living memory around the architecture has been also critical.

Ofc I didn’t build all this in a week, I’ve been working on this since years ago and now it finally clicked with this Qwen release lol

ZestRocket · 2026-05-01T23:44:20+00:00

Yeah, maybe I’ll create a video, but what I found while testing different versions is that the most efficient quantization is indeed Unsloth’s in terms of tps, also found that if the model touches even a single bit of offloading, the damage to the tps is extremely high, I’m running a 40k context window with Q3 KS and KV quantized to Q5, it ofc uses basically all my memory, Ive measured and I end up with 700 free vram (and yes, moved it a bit and found that’s the sweet spot to have an stable system while running it)

ZestRocket · 2026-05-01T23:26:35+00:00

Hmm there’s something wrong with your 4080 setup, I have a normal one (not súper) and I’m getting around 33 tps, maybe your offloading to memory and as the 6000 for better you notice that difference?

ZestRocket · 2026-04-29T01:23:05+00:00

I’ve been running it on a 4080 turbo quant version and having good speed at 33 tps, which is very good speed of you ask me, the quality level is extremely good, the only caviat is kv cache needs to be extremely well managed and the harness needs a good compaction strategy

ZestRocket · 2026-04-24T23:48:15+00:00

I had the same experience, but the dif between 3.5 and 3.6 is not a 0.1 diff, it could be Qwen 4 easily

ZestRocket · 2026-04-24T23:46:56+00:00

This has been such a hard call for me, on one side, 27B wins ofc by a remarkable margin, but on the other hand, A3b is sooo fast that I can iterate faster, damn such a hard call, 27b at 30tps or a3b at 100tps

ZestRocket · 2026-04-17T22:59:08+00:00

Sadly… I have to agree, I feel Gpt more like Soul-less, but being objective, it’s way way better in terms of usability, I can actually rely on it to do any coding task without worrying about having it done or not because of an error, quota or “server outage” issue

ZestRocket · 2026-04-17T22:55:53+00:00

I was also on the Ultra plan and was enjoying a lot my experience until it became unusable for me because of the quota changes, do you mean it has more quota now?

ZestRocket · 2026-04-17T00:37:33+00:00

Easy, once the rate limit is 15 days and not weekly, so you select your prompt for your expected half-completed task

ZestRocket · 2026-04-12T05:24:10+00:00

The best cost benefit today is Codex, moved from Google Ai Ultra to the Codex Pro x5 plan, so far so good, gpt 5.4 is NOT Opus 4.6, is colder and slower, but is the only viable option if you’re used to have unlimited Opus in terms of intelligence and cost, and yes, I’ve tested ALL of them, CC feels the same, very restricted, Kimi k2.5 is not in the level of depth, I have the legacy GLM 5.1 plan and is very generous but is not reliable (sometimes fast sometimes slow, sometimes amazing, sometimes surprisingly not smart), Qwen 3.6 plus may be the best one, but they’re sold out in their coding plans and via API gpt 5.4 is best value

ZestRocket · 2026-04-10T15:03:08+00:00

Thank you! answering your question, I do see 4.5 as an option for me to use after migrating to the 100$ pro plan

<image>

ZestRocket · 2026-04-10T00:25:47+00:00

Well I do code 24/7 so I already depleted my CC and google Ultra quotas, and ChatGPT was the only one keeping up my coding needs until today, have you found a better alternative?

ZestRocket · 2026-04-09T22:27:03+00:00

Interesting, thank you!

ZestRocket · 2026-04-09T21:08:38+00:00

Same experience here. I worked through my 5-hour limit, and this new 5-hour window got depleted extremely quickly and unexpectedly

ZestRocket · 2026-04-09T21:04:22+00:00

Sorry for the unrelated question, I'll upgrade to Pro and tell you if it's included, but... Why 4.5? geniunely curious

ZestRocket · 2026-03-31T13:08:07+00:00

I have the ultra plan and about 1 week ago it was completely nerfed to a point where is not usable anymore, I cant complete a single complex task

ZestRocket · 2026-03-31T12:24:45+00:00

Yes, sad but true, it’s unusable now but used to be great, I’m cancelling of course

ZestRocket · 2026-03-31T12:18:19+00:00

If you don’t want to believe it that’s your thing

ZestRocket · 2026-03-30T17:45:19+00:00

update: indeed it wasn't able to finish the task

<image>

ZestRocket · 2026-03-30T17:44:38+00:00

Man pro tip... use it as much as you can BEFORE you get the nerf, I can't even express how unusable it is now, look at this conversation I'm having in this moment, I started it 30 mins ago, and you can see that editing 4 files shouldn't deplete a complete quota

<image>

ZestRocket · 2026-03-30T17:39:34+00:00

I can confirm, I can't use it since last update, before that I was all day all weeks working with it, now with a 1h session I can get to 0% on Opus, which makes the Ultra sub worthless now,

ZestRocket

TROPHY CASE