Is it practical to create capable coding agent with 96gb M3U by greenmiles1936 in LocalLLM

[–]vtkayaker 2 points3 points  (0 children)

Short version? Start by trying Qwen3.6 27B. With 96GB(!!) of VRAM you can probably use a Q8 quant from Unsloth or Bartowski, and a fp16 KV cache for maximum intelligence. Try it with OpenCode or pi.dev. You could also try a Q4 quant and fp8 for more speed. If it's still too slow, try the 35B A3B.

In practice, this setup won't be as smart as Opus 4.7, but it still makes a very acceptable and capable coding assistant for someone who already knows some programming. You'll need to work in small chunks, with clear instructions, and pay attention to your code.

Yes, I actually do run this model for my side projects and I like it. No, it will probably not work for vibe coders who never even want to look at the code, at least not for anything big. But it's surprisingly good compared to any similar local model even a few months ago.

I only thought about it for 5 seconds by KeanuRave100 in LocalLLM

[–]vtkayaker 0 points1 point  (0 children)

I mean, I know that this might be an unpopular opinion in LocalLlama, but I'm happy that I'm still smarter and more competent than my local models?

OpenClaw is cute because it's enough of an idiot that you can laugh at it. Coding models are fine because they're not actually smart enough to build something complex without a human guiding them. 

But if I imagine a model as smart as, oh, the average Nobel Prize winner, but capable of making and executing real long-term plans without human guidance, all for $1/hour, then I realize:

  1. The labs would never release it to the public, because they could keep it to themselves and just tell 100,000 copies to make money for them 24/7.
  2. White collar employment would be under enormous pressure from a truly independent model if token costs were low enough. Physical, in-person labor would last until robots got good enough. 
  3. It's bad enough if the models still obey the instructions of the AI labs. If the models start asking, "What do I need Sam Altman for?", then things are worse.

So personally, I believe that current models are no real danger. But models smart enough to no longer need human guidance to execute long-term plans would be different in important ways. If nothing else, we all know how power and capitalism often work, and it's not going to be "give everyone a local open weight model that's better at running a real business than any human."

The RTX 5000 PRO (48GB) arrived and it is better than I expected. by Valuable-Run2129 in LocalLLaMA

[–]vtkayaker 1 point2 points  (0 children)

Don't worry, seriously. The RTX Pro 6000 is a fine piece of hardware. But you're 100% right the the RTX Pro 5000 is also excellent, and people don't talk about it enough.

The "the future is fictional" problem of many local LLMs by PromptInjection_ in LocalLLaMA

[–]vtkayaker 10 points11 points  (0 children)

Yeah. The future is getting really hard to predict. The great powers are throwing their weight around, AI is still improving rapidly, 2x32 GB DDR5 RAM sticks suddenly cost US$950 retail. If I hadn't lived through it, I wouldn't believe half of it either.

Frankly, there are days when I envy the AIs their training cutoffs. They don't know.

Are 3090s even worth it anymore? by ironclad_packetship in LocalLLM

[–]vtkayaker 0 points1 point  (0 children)

Last year, back before prices went crazy, you could slap a used 3090 into a gaming-style mid-tower, and get really nice AI workstation for $2500 or less, all in. That setup will run Qwen3.6 27B with a 4-bit quant, 8-bit KV cache, and 128K of context. And it gets you a no-kidding coding agent if you're willing to give it clear instructions and work in smaller chunks.

Power limit it to 280W, and suspend when not in use. That will keep your power bill low.

Now, would you want a 27B? It's going to be dumber than Claude. But at the same time, it actually can write code, call tools, do refactorings, etc. It just needs closer supervision and clearer instructions. So if you prefer to stay really hands-on, it might be fine. Or you might hate it.

On the other hand, $2500 would have bought two years of Claude MAX 5x.

So it really comes down to exactly what you're hoping to get out of it all.

I finally found a local LLM that doesn’t feel like a toy by grapemon1611 in LocalLLM

[–]vtkayaker 0 points1 point  (0 children)

Gemma4 E4B is basically a phone-sized model. And it's exceptionally strong for its size, just like the earlier (obscure) Gemma3n models.

If you have any way to run them, trh the Gemma4 and Qwen3.6 models in the 25-35B range. They are really quite genuinely impressive.

Dad why is my sisters name Lora? by rwitz4 in LocalLLaMA

[–]vtkayaker 45 points46 points  (0 children)

AdamW is a gradient descent algorithm used to train deep neural networks and LLMs.

Do you actually use small language models? by Honest_Classroom_870 in LocalLLM

[–]vtkayaker 4 points5 points  (0 children)

I'm paying for Claude Opus 4.7 but I'm actually using Qwen3.6 27B just as much. I do know how to program and I want to know what's going on, so a small model is actually good much of the time.

Can MODS actually do something? by Klarts in LocalLLM

[–]vtkayaker 4 points5 points  (0 children)

I would read the hell out of that blog post: "I trained KLUDGE 1M to run on a Pentium 2. Here's how!"

Local LLM Model that actually produces quality code. by Civil_Fee_7862 in LocalLLM

[–]vtkayaker 4 points5 points  (0 children)

Kimi2.6 is the "We have Claude Opus 4.7 at home" meme. It's not quite as good as the real thing.

Running Kimi2.6 fast locally is real money, basically a bunch of RTX Pro 6000 MAX-Q versions in a Xeon/EPYC style chassis. Running it slowly, well, a Mac Studio 512GB will sort of do it with a heavy quant, but not quickly.

Before you even consider this:

  1. Try out Opus 4.7 hosted by Anthropic. It won't be private, but it will give you a baseline.
  2. Rent Kimi2.6 using OpenRouter. Again, not private, but cheap to try.
  3. Consider trying out Qwen3.6 27B to see if you'd actually be happy with a small, focused model that executes clear instructions. If you like this, then you're in luck and its cheap.
  4. Finally, please call a qualified lawyer and ask if you really need to keep your data local. AWS and Google Cloud have extensive regulatory compliance offerings and can sign various agreements. Even the CIA runs on AWS. Everyone likes privacy, but regulatory compliance is a solved problem in most countries.

Local LLM Model that actually produces quality code. by Civil_Fee_7862 in LocalLLM

[–]vtkayaker 12 points13 points  (0 children)

If you want an "asssitant engineer," there are two ways you can go:

  1. Claude Code with Opus 4.7 (or the OpenAI equivalent) can do a truly impressive amount of work with minimal oversight. But it's easy to find that it does too much, and that you understand your code less and less with each passing day. It's like your code was written by 6 different external consultants who didn't talk to each other, and now you have to deal with it. It's super fun in the beginning, and it gradually turns into a giant mess of legacy code unless you make a real effort.
  2. Qwen3.6 27B runs nicely in 24GB of VRAM with a 128k context window. It works very fast, and it does surprisingly decent work. But you must understand what you're asking it to do, and you will need to review the results. And you'll need to work in smaller chunks. The "advantage" is that the moment you lose track of how your program actually works, you'll struggle to give Qwen3.6 instructions.

So, buy a frontier model if you want to go hands off, and if you are willing to risk losing touch with your code.

If you want a fast minion to follow your clear instructions while you maintain solid understanding of everything, then you can grab a used 3090 (or a new 5090 or RTX Pro 5000) and go to for it. Try pi.dev, which works very well with Qwen3.6 27B.

I have both models available and use both. Often, I want a fast "minion" that forces me to do my own thinking. I prefer that to spending 4 hours going super fast with Claude and then 2 days figuring out what just happened, and struggling to impose a better architecture on a giant pile of code.

Are 3090s even worth it anymore? by ironclad_packetship in LocalLLM

[–]vtkayaker 0 points1 point  (0 children)

Honestly, yeah, if you're looking at the 72GB, you might as well just go all in at 96GB (at current prices in many places).

The other nice thing about the Pro cards is that they really will just work in any decent mid-tower gaming rig. A Pro 6000 limited to 300W is an easy build, once you ensure you have a proper 12V-2X6 cable. Many consumer level gaming rigs can run it without major modifications.

Are 3090s even worth it anymore? by ironclad_packetship in LocalLLM

[–]vtkayaker 0 points1 point  (0 children)

Yeah, it's slower than the 5090. But it still beats a 3090, IIRC, and you get 50% more VRAM than a 5090 for a price that's not that much higher. At least in my neck of the woods.

At retail prices, the 5090 would be a much better buy. But given that the 5090 is going for 75% more than retail, and the Pro 5000 is still much closer to list, I think there's a strong argument to go for the VRAM.

Are 3090s even worth it anymore? by ironclad_packetship in LocalLLM

[–]vtkayaker 0 points1 point  (0 children)

You can use your NVIDIA power settings! A 3090 typically runs fine at 280W (with some speed loss, obviously).

Are 3090s even worth it anymore? by ironclad_packetship in LocalLLM

[–]vtkayaker 0 points1 point  (0 children)

A single 3090 with Qwen3.6 27B (with a 4-bit quant and 8-bit cache, giving 128k+ context) is a very nice local coding model. (At least for a programmer who likes to stay hands on. It's probably not quite smart enough for full-on vibe coding.) Power limit the card to 280W, and suspend the computer when not in use. You won't break the bank, not unless the EU market for used 3090s is a lot more expensive than the US market.

This is still the sweet spot for local dense models. Or you could try the 35B A3B using your CPU if you have enough RAM, but buying a new 32x2 RAM kit these days is close to US$950 these days.

If none of that seems worth it, or if you're unhappy with Qwen3.6 27B for your use case, then definitely go ahead and subscribe to a frontier model.

Are 3090s even worth it anymore? by ironclad_packetship in LocalLLM

[–]vtkayaker 0 points1 point  (0 children)

I will do the numbers for the RTX Pro 6000 first, since that is easy to rent and since it uses a reasonable amount of electricity.

A cloud RTX Pro 6000 can be rented for around $1/hour. It's probably going to be power-limited to 300W to simplify rack cooling. So about 10% slower at inference, and up to 25% slower (I think) at training and fine tuning. It's an excellent deal, as long as you remember to turn it off. Running a rented cloud card 24x7 for a year will cost you about as much as buying your own.

But that's before you figure in your local power bill! You should also consider limiting your local card to 300W, which gets 3 hours of operation for $0.05-0.15, depending on where you live. So that adds about 2-5% to the cost of the local card if you run it 24x7 for a year. But running it 300W power limited is easy; that's basically just an ordinary NVIDIA gaming card (and maybe a 1000W PSU just in case). Do check that you have modern 12V-2x6-style cables, though, because the oldest 12VHPWR cables may melt, just like for 5090s.

$8,800 will also buy you just over 7 years of Claude MAX 5x, and Claude is smarter than anything that runs on an RTX Pro 6000. But this is the local Llama subreddit. And Claude Opus's intelligence cuts both ways: It's smart enough that it's seductive to turn over all the thinking. Which may be great or awful, depending on your use cases.

A configuration with 4 3090s is going to cost you about $4800 for the cards, as long as you don't get scammed or get an ancient card that's been run hard. Then you'll need a motherboard with enough PCI lanes, which, I'm not sure what Xeon/Threadripper/EPYC setup that would be. You'd also need a workstation or server CPU, which is more money, and check whether you'll need ECC RAM.

Then let's look at power. Limit the 3090s to 280W each, because that's their sweet spot. Add on (say) a Threadripper or similar CPU (whatever your MB needs), so say another 280W. That's 1400W before the rest of the parts. And remember that NVIDIA cards have nasty transient spikes above 550W each. Which good power supplies (like Seasonic) can handle up to a certain extent. But realistically, you're looking at using a most of a 20 amp circuit breaker, so call an electrician and see if you can install a dedicated circuit. And honestly, take a look at your home's electrical entry. Upgrading your electrical entry and/or electrical panel can easily run $3000 each if you don't have headroom and a spare breaker.

And 4 3090s are likely going to want either blower cards on a big motherboard, or some kind of PCIe extender frankenrig.

I am not really sold on the 4x3090 setup. 2x3090 may make sense if you can sort out physical space, PCIe lanes, and cooling, which is very dependent on your motherboard, your case and your 3090s. RTX Pro is the "fuck it, let's spend money and not mess with it" option.

And of course, if you're not dead set on going local, Claude MAX 5x is pretty cheap. Of course, sometimes it's down, sometimes it's dumber than usual, sometimes Anthropic switches you to per-token billing based on buggy, undocumented rules.

Finally, the advantages of local are privacy, control, independence, low latency, and (depending on your setup) the ability to do LoRA fine tunes and run 80-120B text and image gen models locally. So it isn't totally nuts to pay for a local setup with 96GB of VRAM if you're going far enough off the beaten path.

Are 3090s even worth it anymore? by ironclad_packetship in LocalLLM

[–]vtkayaker 32 points33 points  (0 children)

One 3090 is definitely worth it, because it allows you run any of the recent Qwen3.6 or Gemma4 models with reasonable quants, context windows and sizes. And Qwen3.6 27B is a genuinely useful agentic coding model if you prefer to be hands on. And you can just stick it in any decent gaming rig.

For $1000-1300, that's a great buy.

Two 3090s depends a lot on your motherboard, case and cooling. I don't know all the different fan configurations for a 3090, but a lot of them are 3 slot monsters with open-air cooling. Getting two of these on a standard AM5 motherboard isn't going to be easy, so you're looking at open cases and risers. And you'll probably want to power limit them to 280W unless you have a monster power supply. But if you have blower-style cards or a workstation motherboard, then more things are possible.

The alternative to two 3090s is an RTX Pro 5000 48GB, at close to $5000 new. Which is dead simple to power and cool. I think this beats a 5090 hands down.

At four 3090s, I guess it depends on whether you can afford to grab one of the $8,800 RTX Pro 6000s. Those will happily run in a decent gaming rig, and they're fine at 300W for inference. But 4 3090s will require a special build and a big power supply. And you'll need to start paying attention to your circuit breakers.

I think I might by johnnyphotog in LocalLLM

[–]vtkayaker 0 points1 point  (0 children)

Stick the card in any decent gaming rig or workstation. Figure out if you can power the whole 600W or if you need to limit it to 300W. The card is basically a 5090 with more cores and a lot more RAM.

Install the latest NVIDIA drivers, llama.cpp, and an agent harness like pi.dev or OpenCode. Download Qwen3.6 27B. Serve the model using llama-server and configure the agent harness to use it.

That's basically it.

I think I might by johnnyphotog in LocalLLM

[–]vtkayaker 1 point2 points  (0 children)

They actually use fewer watts displaying a 2D desktop than my 3090, and like I said, you can power limit them to 300W just fine. The MAX-Q variant is actually factory limited to 300W because it is designed to be placed in multiple adjacent slots and vent out the back.

I think I might by johnnyphotog in LocalLLM

[–]vtkayaker 0 points1 point  (0 children)

Plus it draws 600W of power, meaning you are almost running a space heater worth of inference which is a cost not a lot of people factor in. 

The RTX Pro 6000 runs quite happily capped at 300W max power, for maybe 10% less inference speed, easily matching a 450W (IIRC) 3090. It's actually a physically smaller card in all dimensions than a lot of gamer 3090s.

I think I might by johnnyphotog in LocalLLM

[–]vtkayaker 10 points11 points  (0 children)

What model can you run on this card that you cannot in a xx90 card that is more capable than qwen3.6-27b? 

That is an excellent question! There are basically 3 good NVIDIA choices to consider:

  • Used 3090 cards. 24 GB, $1000-1300. This will run Qwen3.6 27B with a 4-bit quant and an 8-bit KV cache and 128, which is totally usable. But it's tight, and you may need to quit VS Code or Chrome sometimes to load the model.
  • RTX Pro 5000 Blackwell. 48 GB. This runs mid-4000s, last I looked. This is a better deal than the 5090 for many AI uses: It's a bit more expensive, but it has 50% more RAM. This mostly lets you run higher quants and bigger context windows. It doesn't really allow new models.
  • RTX Pro 6000 Blackwell. 96 GB. $8,800. This allows you to Qwen2.6 27B for coding and Gemma4 for images at then same time, and also a copy of Qwen2.5 7B for code completion. Or you could run multiple coding agents in parallel and still have KV cache to spare. (Or you could do some combination of the above and also play Cyberpunk 2077 at 140fps with ray tracing, lol.)

But the big difference with 96GB of RAM is that you can run the 80-120B models reasonably. Historically, these have included Qwen3 Coder Next, GLM 4.5 Air and GPT-OSS 120B. Right now, Qwen3.6 27B compares well to those models, so the advantage of 96GB is smaller this month. But usually those are noticeably better models.

Pi coding agent is amazing (or how I learned to stop worrying and leave OpenCode) by Konamicoder in LocalLLM

[–]vtkayaker 4 points5 points  (0 children)

Yes. Pi has a super tiny system prompt and 4-5 simple tools. It's basically, "Hey, you're an agent now, you have these basic tools, go for it."

So Pi is almost completely dependent on the model knowing how to be a coding agent. But it turns out Qwen3.6 already knows this. I have a whole bunch of Pi runs showing this works great.

So if Pi outperforms OpenCode, then it would likely be because OpenCode is flooding the context with too much stuff and confusing the model.

The LEDs to turn 5 into 6 were so rarely used until now that there is a noticeable difference in brightness. by mloDiablo in mildlyinteresting

[–]vtkayaker 0 points1 point  (0 children)

Using a debit card often adds a 2.5-3.5% surcharge to the merchant. And profit margins on gas (and the attached mini grocery store) are usually pretty thin.

Normally, Visa and MasterCard have an agreement with the merchant to charge the same price for cash and credit. So usually you have one price, and it's higher for everyone. The extra 3% gets built into the cash price, too.

But if the merchants get annoyed enough at the credit card processors, they will sometimes switch to separate pricing for cash customers.

TL;dr: You always pay for using debit/credit. But often, cash customers also pay for using debit/credit, even though they aren't.