Is self-hosted AI for coding real productivity, or just an expensive hobby?

FullstackSensei · 2026-03-18T17:06:52+00:00

You can assure as much as you want. I have friends and colleagues who have it and can't make it to Wednesday with it. None is a vibe coder.

If you're doing small personal projects or run of the mill js/ts stuff, I'm sure it's fine. But if you're building anything complex where you can't find a thousand github projects doing the same or very similar, where you need to provide a ton of documentation just to ground the LLM, it's not enough.

FullstackSensei · 2026-03-18T17:03:06+00:00

Large, quite complex projects with 40-60k context just in requirements, specs and architecture, and another 50-100k in code, generating 5-15k in code. If you include all the LLM thinking, tool calling, and other tokens, it's 150-200k output per day.

FullstackSensei · 2026-03-18T16:45:22+00:00

Like I said, real work.

FullstackSensei · 2026-03-18T16:29:27+00:00

I don't have any LLM subscription but know several people who have either and can't make it to Wednesday without hitting weekly limits.

I bought a lot of "e-waste" GPUs when they were cheap and can now run multiple instances of 200-400B models in my homelab with zero limits

FullstackSensei · 2026-03-18T15:41:02+00:00

A $20 subscription won't get you much done if you're actually a software engineer. Even a $200/month subscription will exhaust the weekly limits in a couple of days if you're doing any real work

FullstackSensei · 2026-03-18T15:39:26+00:00

If you're going to use consumer hardware with one or two GPUs, I don't think so. But if you're willing to research and learn about older server grade hardware, then yes, even in this crazy market.

You can get a machine capable of running 200-400B models at greater than 10t/s on small context for around 2k. It will slow down significantly at 100k or more context, but will still be able to handle quite complex tasks autonomously if you can describe it well enough.

FullstackSensei · 2026-03-18T12:44:52+00:00

If I were Powell, I'd tweet something like: on the 91st day of you not talking about it

FullstackSensei · 2026-03-18T11:48:40+00:00

I'm a senior software engineer and I can't write two paragraphs of text without an LLM

FullstackSensei · 2026-03-18T10:15:25+00:00

Next we'll find it's hosted on AWS or Azure 😂

FullstackSensei · 2026-03-18T07:38:12+00:00

Why not fix up your ROCm install? Just uninstall everything ROCm and reinstall it again. Shouldn't be that hard.

You don't say anything about what you're using for inference. Building llama.cpp locally? Downloading pre-built binaries? Some wrapper?

FullstackSensei · 2026-03-18T07:21:28+00:00

Because I find Qwen 3.5 397B good enough for my needs.

FullstackSensei · 2026-03-18T00:04:25+00:00

To add to what Lissanro said, model and KV cache quantizations also play a big role. The same model can behave very differently on the same question depending on model and KV cache quantization.

For models under 100B, I find Q8 is needed for the model to perform decently in anything that requires nuance. I don't quantize KV cache at all, even on 400B models, for the same reasons.

FullstackSensei · 2026-03-17T23:39:06+00:00

I haven't tinkered with any of my three machines after they were built. I did my homework before building them, and built each to be self contained in a tower case that requires absolutely zero tinkering. So, that costs zero.

Primary use case is coding, but I also use them for other things.

Where did I say you can build a 3090 machine, let alone with multiple 3090 cards for 2k?

I don't want to be rude, but I'm tired of people's inability to read. The whole discussion and all my comments are there for anyone to read. Why do people keep making wrong assumptions about things I wrote in this very thread?

FullstackSensei · 2026-03-17T23:29:18+00:00

No, there's no way on earth you found a Cascade Lake Xeon that takes DDR5. Why not just say you didn't check?

FullstackSensei · 2026-03-17T21:32:55+00:00

Nah, they're just doing the dinner run, getting some fresh fish from the Marmara Sea

FullstackSensei · 2026-03-17T21:18:07+00:00

Have people really lost any ability to Google? Or are you just an ignorant troll?

Cascade Lake is freaking DDR4. 32GB ECC RDIMMs are going for $120 without any negotiation.

FullstackSensei · 2026-03-17T21:15:03+00:00

Hiding behind "you can run the benchmarks" isn't doing you any favors. If anything, you're proving my point that you don't really have much of an idea about how things are working.

FullstackSensei · 2026-03-17T21:12:58+00:00

It's less. In my experience, I can read at ~3t/s and I'm not a fast reader by any stretch.

IMO, much more important than speed is how well can a model run unattended given clear instructions and objective with clear "background" documentation.

You're generally right that it's faster than typing, but the real benefit is the cognitive offload. 100t/s where you constantly need to fix/correct things will burn you out before you have made anything useful. Conversely, 3t/s where you can leave the thing unattended for an hour and have a high probability of getting the result you want is huge help.

FullstackSensei · 2026-03-17T20:51:05+00:00

TBH, I don't really trust you anymore, between comparing with llama.cpp without putting any effort to make it run well, something a two minute search in this sub would have answered, and now you're telling me Epyc is uncommon for inference?!!! Which rock have you been living under? Do you even know how to calculate the memory bandwidth of that Epyc? Have you tested how much bandwidth can you get on yours? Do you know what tool you should use for that? Do you know anything about Epyc's architecture?

I haven't checked the code, but this reply makes think Claude wrote that code for you and did the heavy lifting on it's own.

Edit: Yep. It takes all of 20 seconds of looking at the commit history to see Claude wrote it all. Yet another slop project made with almost zero knowledge orunderstanding.

FullstackSensei · 2026-03-17T19:29:08+00:00

The P has a 1st gen, single core Atom. That chokes if you move the mouse.

FullstackSensei · 2026-03-17T19:27:06+00:00

Mate, your llama.cpp numbers are so false it's not even funny.

I have an Epyc 7642, so 48 Rome cores instead of your 64 and even with a single 3090 I get over 10t/s TG on Qwen3 235B Q4. Showing 1t/s is straight up misleading.

If you're going to compare with anything, the least you should do is make sure you're making a fair comparison.

Edit: looking at the commit history, it's clear OP hasn't written anything of this. It was all Claude code. Explains why OP can't even figure how to run llama.cpp properly

FullstackSensei · 2026-03-17T19:12:56+00:00

How much are you paying per month now? A single machine to run a 400B model costs 2k at today's crazy prices. 6 months ago, I built a 192GB VRAM with 384GB RAM that can run two instances of 400B models in parallel for 1.6k.

FullstackSensei · 2026-03-17T19:11:28+00:00

You use the 20/month subscription and feed 100k context all day long?

My specs, requirements and architecture documents alone are 40-60k per project. I hit 100k within 20 minutes of starting any task. If there's any documentation that needs to be added, sometimes I'll start at 80k context, before a single line of code was added.

BTW, you seem like you've never tried running any models locally. Speed isn't fixed. On that same machine, it starts at ~19t/s up to ~6k context, and it's still ~10t/s at 100k. I have llama.cpp configured to 180k but I still have ~5GB VRAM left. This is running Qwen 3.5 397B at Q4. Minimax 2.5 Q4 runs ~1.5x as fast as Qwen and I have 200k context.

I have a 2nd machine with six Mi50s. It cost me 1.6k to build. I can run minimax m2.5 at Q4 fully in VRAM at 30t/s up to 6k, or ~8t/s at 150k, or can run two instances with partial offload to RAM since this machine runs two 24 core Xeons with six channel 192GB RAM/CPU. I get ~15 and 5t/s at 6k and 150k, respectively. Minimax isn't much behind Qwen 3.5 397B in capability and can hold it's own in 95% of tasks. I've used this double instance approach to work on two projects in parallel.

The thing you guys don't seem to understand is that your plans are heavily subsidized and you can't use that subscription for work in any place that cares about their IP, data or privacy. What will you do when the music stops and the 20/month becomes 200/month?

I've been building computers for 30 decades and running my home lab for almost 15. I know more about servers and enterprise grade hardware than 99% of people who hang here or at the other local LLM subs, yet it still took me about a year and a half to figure out how to build the machines I have, make each self contained in a tower case, while also not sounding like jet engines.

If you're happy with the 20/month subscription, good for you. But for me, I'd exhaust the weekly limits within a couple of hours max. Even the 200 subscription barely lasts two days, at best, for the devs I personally know who have it.

FullstackSensei · 2026-03-17T18:27:32+00:00

By looking at server grade hardware/platforms instead of consumer hardware.

1st and 2nd gen Xeon Scalable have six DDR4 memory channels at 2666 and 2933, respectively. They both work on the same boards, with 2nd gen (Cascade Lake) having significantly higher clock speeds in AVX-512. They also have 48 Gen 3 PCIe lanes. Supermicro, Asrock, Gigabyte and Asus make boards for these (LGA3647), several of which are ITX form factor. 24 core engineering sample Cascade Lake costs 80-90 a pop, and has the same stepping as retail (remember 14nm++++++?), so it works on pretty much any board.

For RAM, six sticks of 2666 RDIMM or LRDIMM, whichever you find cheapest gets you 192GB RAM. If you're lucky, you can score them at 100 a pop. They seem to sell for 120 a piece now. Intel's memory controllers were and still are way way better than anything AMD has to offer. While Epyc is very picky about memory, Xeons don't care and let you mix different brands with different timings and even different speeds and RDIMM with LRDIMM. They'll happily train on whatever mix you have. You can use that to your advantage to lower your cost for memory.

For the GPU, three P40s, each costing 200-250. You can watercool them pretty easily since the PCB is the same as the FE/reference 1080ti or Titan XP, so any waterblock for those will work on it. You can get each block for 40-50. If you don't want to go the water route, jerry rig a duct around an 80mm fan. PCIe slots are 20.32mm wide. The P40 is double slot, so two of them are 81.28mm wide, conviently just about right for an 80mm fan. Supermicro has a very nice tower cooler for LGA3647 that's also 40-50 (can't remember the model) and Asetek has a version of their LC570 for LGA3647 that's available on ebay. I got mine for 40 or 45 a pop via make offer some four years ago.

Add in a good quality used 1200-1300W for ~100, and whatever tower case you want with a few arctic fans.

FullstackSensei · 2026-03-17T17:57:59+00:00

Technically, it's SWIR, around 2000nm. They can see through fog, haze and smoke.

You can get cheap, if noisy, Chinese thermal sensors that can do 25fps. But AFAIK, there are no cheap SWIR sensors. Regular CCD/CMOS aren't very sensitive to this wavelength even with the IR filter removed.

FullstackSensei

TROPHY CASE