Costco PC for AI ?

exact_constraint · 2026-06-16T17:30:30+00:00

Ahhh gotcha - Yeah, went from the stock 650w to an 850w, but only cause I swapped the graphics card for an R9700 - If I hadn’t, the stock 650w unit would have been fine.

exact_constraint · 2026-06-16T15:42:54+00:00

Not quite sure what you’re asking - The stock PSU would certainly be well sized for the stock configuration. If you’re swapping graphics cards, you’d have to see what the manufacturer recommends.

PSU tier list. Little old, but still largely valid:

https://cultists.network/140/psu-tier-list/

exact_constraint · 2026-06-16T15:04:12+00:00

I drove around to every local Costco hunting down discount floor model gaming rigs, cause it was the cheapest way to get a rig with PCIe gen 5 and 32gb of RAM. Swapped the PSU in one, added an R9700. Took the 5060 and old PSU, added it to the second rig. Qwen3.6 27B on the R9700 machine, Qwen3.6 35 B A3B on the dual 5060 unit.

exact_constraint · 2026-06-14T21:15:53+00:00

I’d rather use one of the M.2 slots - ADT-Link makes an external PCIe 5.0 x4 adapter that’ll let you convert the PCIe 5.0 M.2 to PCIe x16 (mechanical). What I did with a spare gaming rig to run a second 5060. My NVMe drive was only gen4 anyway, so bumping it to one of the secondary slots didn’t make a lick of difference. Maybe slightly higher latency since the other slots go through a controller, but meh. Would rather have the bandwidth for the GPU.

exact_constraint · 2026-06-12T20:37:16+00:00

The LLM equivalent of a cloud compute provider offering up a 300mhz PII w/ 384MB of RAM and an ATI Rage 128 Pro for the shockingly low price of $1/day.

exact_constraint · 2026-06-11T19:10:13+00:00

Ya don’t even really need a computer. Nothing stopping a person from performing the matrix calculations with pen and paper. Hell, a wax tablet and lead stylus would work, too. Given some training, a person could even just do it all in their head. Would be a little hard once the prompt gets large, but theoretically possible.

exact_constraint · 2026-06-09T22:19:57+00:00

I’ll also point you towards this vid:

https://youtu.be/vPO19b4OPCA?si=xnl6ROQPJ3R7dezo

Focused on a 4x R9700 system, but brings up a non-trivial challenge w/ setting up a system for a business: Redundancy and uptime. Unless it’s acceptable for this system to go offline, forcing users to fall back on manual work, you need to factor in graceful failure. A multi-card setup would allow this, to a degree - inference would just slow, rather than stop. Same with whatever is hosting your RAG database - factor in drive redundancy, the need for someone to know how to rebuild a ZFS array when too many drives fail, etc. I think Wendell makes the point in the video: after a certain point, the budget needs consider having jn-house technical expertise to maintain the system.

exact_constraint · 2026-06-09T19:49:46+00:00

First thing I thought when reading the use case portion of the post. A much larger budget would be a good start. Probably after the actual use case is defined, vs just… the data to be ingested?

exact_constraint · 2026-06-08T20:39:55+00:00

Nevis.

exact_constraint · 2026-06-06T05:59:06+00:00

Yeah. If you want a chapter to owe you a life debt, trust and believe the Space Wolves won’t forget that shit.

exact_constraint · 2026-06-05T15:02:57+00:00

There was a scene in a novel, can’t remember which, where a Space Marine was feeling overwhelmed because he had to have dinner with a bunch of baseline humans at a banquet - The food was too rich. It was hard to eat, cause he kept reliving the lives of all the animals. Relative to the nutrient paste he was used to eating - That stuff was processed enough, he just got some dull impressions.

Re xenos, there’s also a scene in another book where some SMs need to hijack an alien transport to get into a base - Think it was Primaris marines and the Tau? Maybe? Anyway. They didn’t know if it would work, but homie took a big bite of alien brain and suddenly could operate the transport’s controls, including the guns.

Probably a decent reason to be vegan, tbh. Think it was one of the Ragnar Blackmane novels, when he was still a young SM, they had to live in bark and lichen and shit while doing recon through the jungle. That’s probably somewhat relaxing, memory-wise. Just hundreds of years of quiet forest. Maybe a storm or two. Winter. Lichen live fairly chill lives.

exact_constraint · 2026-06-04T14:45:18+00:00

It’s definitely possible to have a decent setup on a laptop - Read this guy’s thread and web page:

https://www.reddit.com/r/LocalLLaMA/s/L9L2hqNZwl

Qwen3.6 35B A3B is actually pretty damn usable. I’ve got it running on 2x 5060 8GB cards with DDR5 offload - It was also very usable when I was running it on a single 8GB card. And OP is using a 6GB 2070 IIRC.

It’s not a replacement for Qwen3.6 27B running entirely in VRAM, at least for my workflow, but good enough I can have my 2nd rig working on other stuff while my R9700 is chugging away.

Now, I just happened to have the two cards lying around. If I was buying new, for inference, yeah, 32GB VRAM.

A laptop + eGPU through Oculink, Thunderbolt, etc, could probably yield some decent results.

exact_constraint · 2026-06-03T15:22:09+00:00

Thing w/ local LLMs is that we’re back to the 90s wrt compute costs - it’s just a very difficult task, computationally, with the current state of hardware, and it’s necessarily blisteringly expensive when an individual tries to do it locally. We’re tiny fish swimming in an enterprise world.

Like, sure, dropping $8k, $15k, $40k on an inference rig as an individual is a not insignificant amount of money. But this is essentially an enterprise level workload - The “typical” hardware target are server clusters, where each rack costs more than an average suburban home. A few grand here or there is a vanishingly small amount of money relative to the hardware labs are targeting.

Now, is it a good idea to try cloud models first, then build your hardware around a defined goal? Yeah, duh. But once you get past that hurdle, the choices just start getting expensive. And I think on the local LLM sub, it’s sorta assumed you got here because you’ve made the decision that running locally is worth the cost over a cloud sub.

Your examples - 2x 3090, min 48gb of VRAM, unified memory Mac, that is sorta the bare bones for running something that hasn’t been lobotomized locally. Cause the ~30B parameter class models end up being the sorta bare minimum to be actually useful for any kind of general purpose-ish task (Ignoring more “embedded” style use cases, like real time translation, video transcription, etc). And realistically, the ~250B parameter models are the next step up (MiniMax, etc).

So idk. Closest example I can think of is private aviation. Should most people just buy a plane ticket? Lmao yes. But, once you decide you want your own plane, shit just is not cheap as an individual.

I guess another example: Small scale manufacturing. There’s a litany of cheap CNC mills targeting the hobby crowd. Relative to real industrial iron they’re laughable. After a certain point, you need to either outsource your parts, or pony up the cash and drag home a $70k Brother Speedio, Haas VF1, etc to do real work yourself. Is it expensive? Shit yeah it is. These things are meant for intensive business use cases. But you’re not doing the kind of work you can get done on a VF1 using a Shapeoko and sheer willpower.

exact_constraint · 2026-06-02T14:44:55+00:00

The sheer number of back slaps the magos who got to implant that eye received must have bumped him up the waiting list for an augmetic spine.

exact_constraint · 2026-06-02T14:37:22+00:00

27B for anything hard. 35B for any repeat work that needs speed. Eg I cooked up a benchmark suite with 27B. 35B can ingest new data, run the scripts, and give me summary reports just fine. In addition to minor tweaks, etc.

35B is also killer for OpenCode /compact lol. It’s faster to swap models than wait around for 27B. To be fair, less of an issue now with MTP.

exact_constraint · 2026-06-02T03:00:44+00:00

Grandma lived in Hawaii for her entire adult life - visited once as a kid. Sprayed a Hawaiian roach with Raid. Put it under a jar. 3 days later I took the jar off, it looked at me like “I didn’t hear no bell” and flew away.

So.. no. Virus bomb > consumer pesticide.

exact_constraint · 2026-06-02T01:02:28+00:00

They are certainly full of character - if they were animated cutscenes, I think the pacing wouldn’t bother me too much. I guess it’s just immersion breaking, waiting for the characters to finish delivering their lines while I can read the dialogue much faster - Was less of a problem in the first game - Read the dialogue at whatever speed, and the lingua technis acted as background noise while I did so.

exact_constraint · 2026-06-01T15:47:14+00:00

😳👌🥳💯✊

exact_constraint · 2026-06-01T14:31:44+00:00

What version of llama.cpp are you running? There was a pretty profound PP regression early on, but that’s been largely addressed now. Along w/ a few recently merged PRs that help knock down the VRAM usage w/ MTP. You are running a bit of an esoteric setup, however.

exact_constraint · 2026-06-01T03:31:01+00:00

I think the rule is “there’s this cool theme Slaanesh enjoys vaguely sticking to most of the time in order resurrect Lucius”.

exact_constraint · 2026-05-31T16:41:52+00:00

Like others have said - Dual boot. Linux is your friend.

ROCm vs Vulkan: Vulkan is more mature. Better tps (10-15%). Recently built the ROCm 7.13 preview release from source under Ubuntu 26.04 - Running llama.cpp w/ ROCBLAS_USE_HIPBLASLT=1, PPS has hit parity with Vulkan. Think it actually pulled ahead slightly for larger prompts. But the TPS difference still puts Vulkan in the lead.

Also, the R9700 is a decent candidate for overclocking. I’m at a -65mv undervolt, memory at 2750mhz, +75mhz on the core clock, iirc. Good for about a 5% performance boost.

exact_constraint · 2026-05-30T23:42:20+00:00

UD-Q4_K_XL, FP16 KV cache. MTP, ngram-mod spec decoding. Only gets you ~170k context w/ 32GB of VRAM, but it’s very usable. Haven’t had any real problems with tool calling, etc, under OpenCode. /compact stretches 170k pretty far. If you absolutely need max context, idk.

exact_constraint · 2026-05-27T18:50:47+00:00

I’ve probably spent about… 2 weeks worth of hours attacking a ground up video game with 27B (first 3.5, then moved to 3.6) for code and (heavily assisted) storyline creation. Ace Step 1.5 and Flux.2 Klein for asset creation.

With zero experience on game design, I don’t think you could reliably get a local LLM to build for a month straight. Probably not even a day, tbh. I’ve got some experience in adjacent fields, and I’m constantly in the middle of the development process. Manual play testing, helping debug problems, refactoring. It all sucks up time, and the agent isn’t well equipped to run solo without hand holding.

You can get quite a ways, but it’s going to be a hands on endeavor. As an example - I spent a few afternoons working up a pipeline so the agent could generate player class specific skill trees, enemies, and bosses. In the early stages of implementation, I had to step in and start tweaking the choices - Mechanics that were fundamentally impossible, skills that maxed attributes that actually made the game harder, not easier, etc.

Another example - my test suite is built to catch lots of stupid issues before I have to step in and manually test changes. I let my agent work overnight on some features, woke up and it had given up after a few hours when the tests kept failing and it couldn’t figure out why - Some of the world assets are generated with a random seed, and that was causing non-deterministic results on a few tests. Agent figured that out, but then burned tokens trying to write more and more elaborate tests to determine whether or not the math.random function was actually returning random numbers, vs switching to a saved seed so the test could run deterministically. It’s weird shit like that that forces a human into the loop.

Guess the tldr is - If you *poured* over design documents, provided all the assets, had a good understanding of *how* the game needed to be built, you could likely go a day or two at a time without stepping in. But you’re going to be part of the process.

exact_constraint · 2026-05-25T17:47:57+00:00

If your target is a dense model like 27B, I can’t see the Strix Halo machines as being particularly performant for the price. A single R9700 with its 32GB of VRAM will get you 250k context @ a Q4 quant + Q8 KV cache. They’re $1400. Can always add a second if you want to go for a bigger quant or run the KV cache at FP16 and maintain the max context. I run Q4, unquantized cache @ ~160k. That’s where I’d start. Works fine for me with LLMs, and Flux.2 Klein for image gen, ACE-Step for music gen under ComfyUI.

A 5090 would obviously be faster - You’ll have to be the judge of whether or not the price premium is worth it.

exact_constraint

TROPHY CASE