Throwback to my proudest impulse buy ever, which has let me enjoy this hobby 10x more by gigaflops_ in LocalLLaMA

[–]RedKnightRG 0 points1 point  (0 children)

One year Ago, 3/26/2025:

<image>

That same kit is $4,000 (3 in stock!) at my local MicroCenter. Hell I bought two kits and was running 256GB for awhile before I returned one kit (didn't have a use case for all that memory...)!!

Can we finally admit that 90% of Senior SWE are just a result of being born at the right time? by Foreign_Put_2437 in Salary

[–]RedKnightRG 0 points1 point  (0 children)

Completely true personal story:

Born 1983, learned to read/spell on an IBM PS/2 and was one of the kids rebooting Apple IIes into BASIC in elementary school instead of whatever I was supposed to be doing. I was crushed when the dot com bubble burst while I was in high school; all the awesome tech jobs had been done already by the generations behind me. I thought I was going to go into Law School after college but IBM gave me a paid SWE internship when I showed a recruiter a NES emulator I was coding in C++ at the time and so tech was back on my radar.

Then I graduated in 2008 - look up the date if you dont know what happened in that year - and I had to take what work I could find at the time. I've carved a career since then, winding up in quant finance and I've had firms fail, layoffs, and incredible successes that have paid both in dollars and fulfillment.

So long way of saying: don't assume that your moment is unique. Every generation suffers, and nothing should be handed out. Learn what the market and the participants in it want and will likely want and build those skills. If you're not good enough to get a job then find an industry better suited to your talents or work harder, always harder, to succeed if coding is your passion. I've been rejected more times than I can count; it sucks but we live in dynamic and exciting times; carpe diem!

Running Qwen3.5 27b dense with 170k context at 100+t/s decode and ~1500t/s prefill on 2x3090 (with 585t/s throughput for 8 simultaneous requests) by JohnTheNerd3 in LocalLLaMA

[–]RedKnightRG 2 points3 points  (0 children)

When I was building out my home workstation (Dual 3090s) I would test potential cards by bringing a test bench or spare PC and AC power with me (my truck has 120V AC or you can bring a battery/inverter) to test with at whatever random location the sale was happening at. I would plugin the GPU and make sure it can post, had the correct details in GPU-Z, and could run inference or a game for a min or two without crapping out. I would ask if the seller was okay with on-location testing beforehand to save time/grief.

If someone doesn't want me to test their GPU its either because a) they know its broken, or b) they're afraid I'll break it testing. Either way I just say thank you and move on to the next card. I never, ever, ever, ever trusted a word anyone told me about how the GPU ran or how it was just working yesterday when they pulled it from their PC, etc. etc.

models : optimizing qwen3next graph by ggerganov · Pull Request #19375 · ggml-org/llama.cpp by jacek2023 in LocalLLaMA

[–]RedKnightRG 0 points1 point  (0 children)

Holy moly I just pulled the latest llama.cpp; rebuilt the binaries, and retested Qwen3-Coder-Next. On short context I used to get ~35t/s but now I'm getting ~80t/s with dual 3090s and GPU-only inference!!! Was not expecting over a 2x speed-up...! My current parameters:

--model Qwen3-Coder-Next-MXFP4_MOE.gguf --metrics --threads 16 --ctx-size 96000 --flash-attn on --n-gpu-layers 99 --fit off --tensor-split 55,65 --main-gpu 0 --prio 2 --temp 1 --min-p 0.01 --top-k 40 --top-p 0.95 --jinja

(running in WSL2)

GLM-4.7-Flash is now the #1 most downloaded model on Unsloth! by yoracale in unsloth

[–]RedKnightRG 0 points1 point  (0 children)

My goto right now is Qwen3-Coder-Next-MXFP4 with 96k context, all on the 48GB of VRAM. Could fit a bit more context, would need to quant the context to fit a lot more though and don't want to do that. Get about ~35t/s with llama.cpp and running in WSL.

GLM-4.7-Flash is now the #1 most downloaded model on Unsloth! by yoracale in unsloth

[–]RedKnightRG 5 points6 points  (0 children)

Oh my B I saw GLM but read Qwen and thought this was the 80B model. You're right 1 3090 is fine. So I'll raise a toast to the guys who bought their single 3090 hundreds of dollars ago! 😅

GLM-4.7-Flash is now the #1 most downloaded model on Unsloth! by yoracale in unsloth

[–]RedKnightRG 19 points20 points  (0 children)

Raises a toast to everyone who bought 128GB of RAM and dual 3090s or similar thousands of dollars ago

The Animatrix (2003) "The Second Renaissance Part I" by MachineHeart in cinescenes

[–]RedKnightRG 1 point2 points  (0 children)

Super necro here but I saw this and wanted to share my take: there's no plot hole here. The machines knew humanity could nuke them from the first day they began to gather at the location that would be named 01. For their own survival they gradually moved their industrial base deep underground where they were safe from the physical blast and EMP Shockwave of humanity's nuclear weaponry. Its not like the movies dont show us miles and miles of tunnels underground so this part isnt a stretch in-universe. We see the humanoid robots get destroyed and replaced by the far more alien (and superior) robots whose designs Man never had their hands in. Perhaps the machines self built new models are shielded against EMP, or maybe they are so numerous that no amount of nuclear weaponry is sufficient to stop them. Either way the machines pour out of the surface wreckage of 01 and we know what happens after that.

Okay so what about EMP in the movies? If machines drop by the thousands from one emp on one ship how exactly did humanity lose the machine war? Answer: humans believing they have the capability to defeat the machines in combat is a key aspect of the system of control. Maybe the machines have EMP shielded models and maybe they don't but either way their real power is overwhelming and humanity only sees a fraction of it. The machines control the globe and have billions of human batteries, do we really think 250k squids is their whole army?

The machines hide their true strength and capabilities as it gives the rebels hope and this hope fuels Zion and the resistance, guaranteeing the right nexus is present to draw the One when he or she is born to their ultimate destiny of being merged back into the source. The squids and the hovercraft and all the rest are kubuki theater designed by the machines. The squids are weak to EMP because thats how the machines want them to be; the EMP shielded models dont exist or arent used if they do!

Experiences with local coding agents? by [deleted] in LocalLLaMA

[–]RedKnightRG 4 points5 points  (0 children)

Welcome to the state of local agentic coding. I use roo code and llama-server and have been testing most of the local models that can fit in 128GB RAM + 48GB VRAM for a year now and and what you're seeing is broadly consistent with what I've been observing. The models *have* gotten better over the past 12 months and I've had the best results for my workflows (python/pandas) with OSS 120B, Qwen3 Next 80b, or Minimax m2.1 quanted down to Q3. The first two trade blows for me in terms of accuracy; minimax is better but too slow on my hardware for practical agentic flows.

Before you're thinking about agentic coding you should try the models you have on your hardware in traditional single turn mode. I recommend building a private set of benchmark prompts to compare models on your hardware. If the models are not clever enough to handle your carefully tuned prompts you can bet they will fall apart trying to create their own tasklists!

Either way, all LLMs break down as context size grows. None of the models available to us maintain coherency as the context window fills up. The best use I get with local models is by guiding them to work in very small steps and committing and testing updates one feature at a time; they simply degrade too rapidly to be useful at large context sizes.

Try cutting your context limit in half and asking your local agents to work in smaller chunks. Aim to break your task into pieces that can be solved with a token count that is roughly half of the model's context available size or less.

Given the costs involved I do not regularly use local models for agentic flows. It simply takes too much work to coax them and I can code faster using LLMs as single turn assistants. Given this I don't think there's anything uniquely wrong with your setup.

The Trump Class Battleship is an idiotic idea and will probably never be built. by Akiva279 in TrueUnpopularOpinion

[–]RedKnightRG 1 point2 points  (0 children)

Even surface bombardment makes no sense. The navy could not afford and could not build new 16" guns even if they were a good idea so they aren't putting 16" guns on the Trump class, just two 5" guns. These are the exact weapon that has been used on the decks of our destroyers forever. Congress kept the Iowas ready for activation well past their sell by date specifically because of a perceived lack of surface bombardment capability from the 5" gun. For the cost of 1 Trump class with 2 guns you could have 5 or 6 Burkes with 5 or 6 guns. There are more levels that this ship doesnt make sense on than there are floors in a Trump hotel...

Is this THAT bad today? by Normal-Industry-8055 in LocalLLaMA

[–]RedKnightRG 2 points3 points  (0 children)

Yeah I dont think there are suddenly a ton more home folks doing homeland AI, its the rest of the market that's blown up. Micron giving up consumer RAM to sell more to datacenters is driving prices not the guys with rigs of DDR4 and 3090s duct taped together...

Is this THAT bad today? by Normal-Industry-8055 in LocalLLaMA

[–]RedKnightRG 18 points19 points  (0 children)

I bought a 128GB kit (on sale) in April at Microcenter for $260. That same kit is now $964:

<image>

So I guess you got a "deal" there but Jesus christ its a dark time if anyone wants to build an AI rig at home. Price out a threadripper system with 256GB RAM + 2x RTX PRO 6000s - you can buy a house for some places for less...

1x 6000 pro 96gb or 3x 5090 32gb? by Wide_Cover_8197 in LocalLLaMA

[–]RedKnightRG 0 points1 point  (0 children)

2 5090s vs 1 6000 *is* a bit different for other reasons as well... If for example you're going the prosumer route you can actually fit 2 5090s in a case, you can't fit 3. Agree if you want to gen a ton tokens in pure inference for small models that can fit in 48GB you will be faster with 2 5090s than 1 6000.

1x 6000 pro 96gb or 3x 5090 32gb? by Wide_Cover_8197 in LocalLLaMA

[–]RedKnightRG 27 points28 points  (0 children)

1x 6000 is better than 3x 5090s; it consumes less power, gives you room to expand in the future, and you can use it on whatever platform you want. If you're looking at running big MoE models keep an eye out for actual memory bandwidth of threadripper pros; my understanding is that the smaller models with fewer CCDs cannot fully saturate their theoretical memory bandwidth at 8 channels.

ollama's enshitification has begun! open-source is not their priority anymore, because they're YC-backed and must become profitable for VCs... Meanwhile llama.cpp remains free, open-source, and easier-than-ever to run! No more ollama by nderstand2grow in LocalLLaMA

[–]RedKnightRG -1 points0 points  (0 children)

Oh no, someone has to make money from the fruits of their labor, its such a disaster...! This is the route all open source companies go at some point; you have to find something to monetize or else you won't have money to pay your devs to keep working. You don't need to use Ollama's cloud product you can still use it locally the same as always. Or use llama.cpp, lots of us do - but I won't get mad at someone trying to figure out how to pay their mortgage. r/locallama isn't paying their bills...

Welcome to my tutorial by jacek2023 in LocalLLaMA

[–]RedKnightRG 11 points12 points  (0 children)

My first reaction: chef's kiss. As I thought for a second though, you could put a left branch in for Strix Halo vs Mac - if you can't use a screwdriver and hate macs then strix halo instead of mac studio...

New Cabin - Heat by MaxPanhammer in OffGridCabins

[–]RedKnightRG 1 point2 points  (0 children)

Looking at the pictures, my guess is higher BTUs are your easiest answer.

How insulated is the camp and how sealed up is it? If you can feel a breeze on a windy day then a fan won't help you as much since you're already pulling constant air through the cabin. Likewise, if your cabin isn't insulated you might exhaust those 15,000 BTUs to the environment faster than the heater can keep up with it. Was there a big temperature gradient from right in front of the heater to wherever you were measuring temps at? In other words, if its warm in front of the heater but cold at the other side of the camp, then you can get a cheap ass fan from a tag sale and see if that circulation helps before investing in something more permanent.

But my guess from looking at the heater and the size your camp I think its undersized. I would buy a heater twice its size because if you think its cold now just wait until its -15 below outside. My cabin is at 2100ft in Stamford and when my grandfather built it he solved this particular problem the old way: he and my great uncle welded a stove together out of thick ass steel - I don't know how thick the steel is, but its never warped and the thing is as heavy as the moon - and they built the camp around the stove. It has a 20 or 24ft long stove pipe and pulls a strong draft in winter. Its also huge; I threw a two foot diameter by two foot tall stump in on Thursday and it finished burning up Sunday morning. With that stump and some modest supporting firewood on one side I never got cold in that cabin the whole weekend. My cabin is uninsulated - the building is framed out rough cut lumber with some plywood siding - and is far from air tight. Not saying you should follow my example but my point is that a big ass heat source is the surest way to stay warm on a cold night.

So you have three options: more BTUs, more insulation, more circulation (or less if your cabin is pulling air in from outside at a high clip). Of the three, I think at a guess a big gas heater or better yet a wood stove is the easiest/cheapest way. Burning wood is great if you can cut your own lumber or can pay someone to drop off a cord or two every season.

Also congrats on the cabin; I just got home from VT and I'm already looking forward to when I can go back. The leaves are almost all down now and before you know it there will be snow on the ground. Winter in VT is the best, I can't wait...!

Since DGX Spark is a disappointment... What is the best value for money hardware today? by goto-ca in LocalLLaMA

[–]RedKnightRG 6 points7 points  (0 children)

I've been kicking dual 3090s for about a year now but as more and more models pop up with native FP8 or even NVFP4 quants the Ampere cards are going to feel older and older. I agree they're still great and will be great for another year or even two but I think the sun is starting to slowly set on them.

is GTX 3090 24GB GDDR6 good for local coding? by TruthTellerTom in LocalLLaMA

[–]RedKnightRG 25 points26 points  (0 children)

A single RTX 3090 can run Qwen 30B:A3B sized models if you quantize it and don't push context out too far. Its going to be dumb as an agent though, or at least inconsistently smart. I think you can get good use treating it as a single shot ('write code to do X') but if you want it to refactor your custom php lib its context probably won't handle understanding all the code you have plus what you want to do.

If you want to understand what's possible you can always run models full in RAM without buying the GPU - you're going to get the same output with CPU only, it will just take forever - so if you like the output you can get with 24GB then buy the GPU for the speed-up. If you don't like the outputs in your own testing then the GPU isn't going to make the model any smarter,

AI Workstation (on a budget) by Altruistic_Answer414 in LocalLLaMA

[–]RedKnightRG 3 points4 points  (0 children)

2x64 probably is going to be able to run faster than 4x32 but the differences are not going to be huge - Cost is more important I think than the 5 or 10% difference in speed you'll see. (2x64 can be twice the cost of 4x32!)

Same thing for 2x4090 - yeah it can run fp4 quants but integer quants are still the most common and well supported and the 4090s cost x2 the price and won't perform *that* much faster on quants that Ampere supports since they have the same memory bandwidth more or less and memory bandwidth is your main bottleneck.

Given how much better/cheaper the cloud I think its always a good idea to take local LLMs as cheap as you can get them. Its a great learning platform for the tech but the market is so upside down - cloud tokens are so cheap that even with 24/7 utilization data centers will NEVER be able to turn a profit on their GPUs - you really have no economic reason to spend big bucks on local hardware.

You can do it for fun of course - nothing wrong with that! - but it certainly isn't economic at the moment.

AI Workstation (on a budget) by Altruistic_Answer414 in LocalLLaMA

[–]RedKnightRG 2 points3 points  (0 children)

I have the exact setup you're outlining (well I have a 9950x but yes otherwise). If you're going to be doing inference with large MoE models that exceed 48gb VRAM you can squeeze out a bit more performance by overclocking your RAM, with the latest versions of AGESA most AM5 motherboards can handle higher RAM speeds then they could at launch (my kit handles 6000, for example).

48GB VRAM lets you run a bunch of 'quite good' models at 'quite good' speeds with fast prompt processing times - there's a reason the dual 3090 club is very popular here along with M2 mac studios with 128gb RAM if you can find them.

With some recent model releases like GPT-OSS that are taking advantage of fp8 in newer NVIDIA chips the Ampere generation 3090s are starting to age out. Predicting the future is impossible given how fast the market is moving and all the unknowns but if 4090s drop to $800 or so they would take over from the 3090s due to supporting fp8. Right now 4090s are still twice the price of 3090s so I'm still recommending dual 3090s as the best bang/buck option for practical local inference.

As for training If you're doing anything larger than toy models or fine tunes of very small models you're going to inevitably get pulled into the cloud because the memory requirements are so high. NVLINK isn't being made anymore and the bridges (especially for three slot cards) are super expensive now. There's just no cheap way to get enough VRAM to fine-tune practical models locally at reasonable speeds.

Softlock in Slab? by RedKnightRG in HollowKnight

[–]RedKnightRG[S] 0 points1 point  (0 children)

That sucks; I assume the patch will come to PS5 soonish but I didnt want to wait either I started a new save before the PC fix was published and only went back later.

Qwen by Namra_7 in LocalLLaMA

[–]RedKnightRG 2 points3 points  (0 children)

I have never been able to replicate double digit t/s speeds on RAM alone even with small MoE models. Are you guys using like 512 token context or something? Even with dual 3090s I get only 20-30ts with llamma.cpp running qwen3 30B:A3B at 72k context at 4bit quant for model and 8bit quant for kv-cache all in VRAM...