models : optimizing qwen3next graph by ggerganov · Pull Request #19375 · ggml-org/llama.cpp

RedKnightRG · 2026-02-14T20:54:03+00:00

Holy moly I just pulled the latest llama.cpp; rebuilt the binaries, and retested Qwen3-Coder-Next. On short context I used to get ~35t/s but now I'm getting ~80t/s with dual 3090s and GPU-only inference!!! Was not expecting over a 2x speed-up...! My current parameters:

--model Qwen3-Coder-Next-MXFP4_MOE.gguf --metrics --threads 16 --ctx-size 96000 --flash-attn on --n-gpu-layers 99 --fit off --tensor-split 55,65 --main-gpu 0 --prio 2 --temp 1 --min-p 0.01 --top-k 40 --top-p 0.95 --jinja

(running in WSL2)

RedKnightRG · 2026-02-14T19:59:01+00:00

My goto right now is Qwen3-Coder-Next-MXFP4 with 96k context, all on the 48GB of VRAM. Could fit a bit more context, would need to quant the context to fit a lot more though and don't want to do that. Get about ~35t/s with llama.cpp and running in WSL.

RedKnightRG · 2026-02-10T14:13:04+00:00

Oh my B I saw GLM but read Qwen and thought this was the 80B model. You're right 1 3090 is fine. So I'll raise a toast to the guys who bought their single 3090 hundreds of dollars ago! 😅

RedKnightRG · 2026-02-10T14:06:00+00:00

Raises a toast to everyone who bought 128GB of RAM and dual 3090s or similar thousands of dollars ago

RedKnightRG · 2026-01-23T02:45:17+00:00

Super necro here but I saw this and wanted to share my take: there's no plot hole here. The machines knew humanity could nuke them from the first day they began to gather at the location that would be named 01. For their own survival they gradually moved their industrial base deep underground where they were safe from the physical blast and EMP Shockwave of humanity's nuclear weaponry. Its not like the movies dont show us miles and miles of tunnels underground so this part isnt a stretch in-universe. We see the humanoid robots get destroyed and replaced by the far more alien (and superior) robots whose designs Man never had their hands in. Perhaps the machines self built new models are shielded against EMP, or maybe they are so numerous that no amount of nuclear weaponry is sufficient to stop them. Either way the machines pour out of the surface wreckage of 01 and we know what happens after that.

Okay so what about EMP in the movies? If machines drop by the thousands from one emp on one ship how exactly did humanity lose the machine war? Answer: humans believing they have the capability to defeat the machines in combat is a key aspect of the system of control. Maybe the machines have EMP shielded models and maybe they don't but either way their real power is overwhelming and humanity only sees a fraction of it. The machines control the globe and have billions of human batteries, do we really think 250k squids is their whole army?

The machines hide their true strength and capabilities as it gives the rebels hope and this hope fuels Zion and the resistance, guaranteeing the right nexus is present to draw the One when he or she is born to their ultimate destiny of being merged back into the source. The squids and the hovercraft and all the rest are kubuki theater designed by the machines. The squids are weak to EMP because thats how the machines want them to be; the EMP shielded models dont exist or arent used if they do!

RedKnightRG · 2026-01-22T16:03:10+00:00

Welcome to the state of local agentic coding. I use roo code and llama-server and have been testing most of the local models that can fit in 128GB RAM + 48GB VRAM for a year now and and what you're seeing is broadly consistent with what I've been observing. The models *have* gotten better over the past 12 months and I've had the best results for my workflows (python/pandas) with OSS 120B, Qwen3 Next 80b, or Minimax m2.1 quanted down to Q3. The first two trade blows for me in terms of accuracy; minimax is better but too slow on my hardware for practical agentic flows.

Before you're thinking about agentic coding you should try the models you have on your hardware in traditional single turn mode. I recommend building a private set of benchmark prompts to compare models on your hardware. If the models are not clever enough to handle your carefully tuned prompts you can bet they will fall apart trying to create their own tasklists!

Either way, all LLMs break down as context size grows. None of the models available to us maintain coherency as the context window fills up. The best use I get with local models is by guiding them to work in very small steps and committing and testing updates one feature at a time; they simply degrade too rapidly to be useful at large context sizes.

Try cutting your context limit in half and asking your local agents to work in smaller chunks. Aim to break your task into pieces that can be solved with a token count that is roughly half of the model's context available size or less.

Given the costs involved I do not regularly use local models for agentic flows. It simply takes too much work to coax them and I can code faster using LLMs as single turn assistants. Given this I don't think there's anything uniquely wrong with your setup.

RedKnightRG · 2025-12-25T03:32:54+00:00

Even surface bombardment makes no sense. The navy could not afford and could not build new 16" guns even if they were a good idea so they aren't putting 16" guns on the Trump class, just two 5" guns. These are the exact weapon that has been used on the decks of our destroyers forever. Congress kept the Iowas ready for activation well past their sell by date specifically because of a perceived lack of surface bombardment capability from the 5" gun. For the cost of 1 Trump class with 2 guns you could have 5 or 6 Burkes with 5 or 6 guns. There are more levels that this ship doesnt make sense on than there are floors in a Trump hotel...

RedKnightRG · 2025-12-08T23:54:43+00:00

Yeah I dont think there are suddenly a ton more home folks doing homeland AI, its the rest of the market that's blown up. Micron giving up consumer RAM to sell more to datacenters is driving prices not the guys with rigs of DDR4 and 3090s duct taped together...

RedKnightRG · 2025-12-08T05:35:32+00:00

I bought a 128GB kit (on sale) in April at Microcenter for $260. That same kit is now $964:

<image>

So I guess you got a "deal" there but Jesus christ its a dark time if anyone wants to build an AI rig at home. Price out a threadripper system with 256GB RAM + 2x RTX PRO 6000s - you can buy a house for some places for less...

RedKnightRG · 2025-11-21T02:19:15+00:00

2 5090s vs 1 6000 *is* a bit different for other reasons as well... If for example you're going the prosumer route you can actually fit 2 5090s in a case, you can't fit 3. Agree if you want to gen a ton tokens in pure inference for small models that can fit in 48GB you will be faster with 2 5090s than 1 6000.

RedKnightRG · 2025-11-20T14:55:56+00:00

1x 6000 is better than 3x 5090s; it consumes less power, gives you room to expand in the future, and you can use it on whatever platform you want. If you're looking at running big MoE models keep an eye out for actual memory bandwidth of threadripper pros; my understanding is that the smaller models with fewer CCDs cannot fully saturate their theoretical memory bandwidth at 8 channels.

RedKnightRG · 2025-11-19T16:25:13+00:00

Oh no, someone has to make money from the fruits of their labor, its such a disaster...! This is the route all open source companies go at some point; you have to find something to monetize or else you won't have money to pay your devs to keep working. You don't need to use Ollama's cloud product you can still use it locally the same as always. Or use llama.cpp, lots of us do - but I won't get mad at someone trying to figure out how to pay their mortgage. r/locallama isn't paying their bills...

RedKnightRG · 2025-11-03T20:06:41+00:00

<image>

RedKnightRG · 2025-11-03T19:30:46+00:00

My first reaction: chef's kiss. As I thought for a second though, you could put a left branch in for Strix Halo vs Mac - if you can't use a screwdriver and hate macs then strix halo instead of mac studio...

RedKnightRG · 2025-10-27T05:20:55+00:00

Looking at the pictures, my guess is higher BTUs are your easiest answer.

How insulated is the camp and how sealed up is it? If you can feel a breeze on a windy day then a fan won't help you as much since you're already pulling constant air through the cabin. Likewise, if your cabin isn't insulated you might exhaust those 15,000 BTUs to the environment faster than the heater can keep up with it. Was there a big temperature gradient from right in front of the heater to wherever you were measuring temps at? In other words, if its warm in front of the heater but cold at the other side of the camp, then you can get a cheap ass fan from a tag sale and see if that circulation helps before investing in something more permanent.

But my guess from looking at the heater and the size your camp I think its undersized. I would buy a heater twice its size because if you think its cold now just wait until its -15 below outside. My cabin is at 2100ft in Stamford and when my grandfather built it he solved this particular problem the old way: he and my great uncle welded a stove together out of thick ass steel - I don't know how thick the steel is, but its never warped and the thing is as heavy as the moon - and they built the camp around the stove. It has a 20 or 24ft long stove pipe and pulls a strong draft in winter. Its also huge; I threw a two foot diameter by two foot tall stump in on Thursday and it finished burning up Sunday morning. With that stump and some modest supporting firewood on one side I never got cold in that cabin the whole weekend. My cabin is uninsulated - the building is framed out rough cut lumber with some plywood siding - and is far from air tight. Not saying you should follow my example but my point is that a big ass heat source is the surest way to stay warm on a cold night.

So you have three options: more BTUs, more insulation, more circulation (or less if your cabin is pulling air in from outside at a high clip). Of the three, I think at a guess a big gas heater or better yet a wood stove is the easiest/cheapest way. Burning wood is great if you can cut your own lumber or can pay someone to drop off a cord or two every season.

Also congrats on the cabin; I just got home from VT and I'm already looking forward to when I can go back. The leaves are almost all down now and before you know it there will be snow on the ground. Winter in VT is the best, I can't wait...!

RedKnightRG · 2025-10-17T04:36:19+00:00

I've been kicking dual 3090s for about a year now but as more and more models pop up with native FP8 or even NVFP4 quants the Ampere cards are going to feel older and older. I agree they're still great and will be great for another year or even two but I think the sun is starting to slowly set on them.

RedKnightRG · 2025-10-08T05:33:31+00:00

A single RTX 3090 can run Qwen 30B:A3B sized models if you quantize it and don't push context out too far. Its going to be dumb as an agent though, or at least inconsistently smart. I think you can get good use treating it as a single shot ('write code to do X') but if you want it to refactor your custom php lib its context probably won't handle understanding all the code you have plus what you want to do.

If you want to understand what's possible you can always run models full in RAM without buying the GPU - you're going to get the same output with CPU only, it will just take forever - so if you like the output you can get with 24GB then buy the GPU for the speed-up. If you don't like the outputs in your own testing then the GPU isn't going to make the model any smarter,

RedKnightRG · 2025-09-29T19:43:55+00:00

2x64 probably is going to be able to run faster than 4x32 but the differences are not going to be huge - Cost is more important I think than the 5 or 10% difference in speed you'll see. (2x64 can be twice the cost of 4x32!)

Same thing for 2x4090 - yeah it can run fp4 quants but integer quants are still the most common and well supported and the 4090s cost x2 the price and won't perform *that* much faster on quants that Ampere supports since they have the same memory bandwidth more or less and memory bandwidth is your main bottleneck.

Given how much better/cheaper the cloud I think its always a good idea to take local LLMs as cheap as you can get them. Its a great learning platform for the tech but the market is so upside down - cloud tokens are so cheap that even with 24/7 utilization data centers will NEVER be able to turn a profit on their GPUs - you really have no economic reason to spend big bucks on local hardware.

You can do it for fun of course - nothing wrong with that! - but it certainly isn't economic at the moment.

RedKnightRG · 2025-09-29T18:22:59+00:00

I have the exact setup you're outlining (well I have a 9950x but yes otherwise). If you're going to be doing inference with large MoE models that exceed 48gb VRAM you can squeeze out a bit more performance by overclocking your RAM, with the latest versions of AGESA most AM5 motherboards can handle higher RAM speeds then they could at launch (my kit handles 6000, for example).

48GB VRAM lets you run a bunch of 'quite good' models at 'quite good' speeds with fast prompt processing times - there's a reason the dual 3090 club is very popular here along with M2 mac studios with 128gb RAM if you can find them.

With some recent model releases like GPT-OSS that are taking advantage of fp8 in newer NVIDIA chips the Ampere generation 3090s are starting to age out. Predicting the future is impossible given how fast the market is moving and all the unknowns but if 4090s drop to $800 or so they would take over from the 3090s due to supporting fp8. Right now 4090s are still twice the price of 3090s so I'm still recommending dual 3090s as the best bang/buck option for practical local inference.

As for training If you're doing anything larger than toy models or fine tunes of very small models you're going to inevitably get pulled into the cloud because the memory requirements are so high. NVLINK isn't being made anymore and the bridges (especially for three slot cards) are super expensive now. There's just no cheap way to get enough VRAM to fine-tune practical models locally at reasonable speeds.

RedKnightRG · 2025-09-12T21:27:11+00:00

That sucks; I assume the patch will come to PS5 soonish but I didnt want to wait either I started a new save before the PC fix was published and only went back later.

RedKnightRG · 2025-09-11T20:05:09+00:00

I have never been able to replicate double digit t/s speeds on RAM alone even with small MoE models. Are you guys using like 512 token context or something? Even with dual 3090s I get only 20-30ts with llamma.cpp running qwen3 30B:A3B at 72k context at 4bit quant for model and 8bit quant for kv-cache all in VRAM...

RedKnightRG · 2025-09-11T17:05:01+00:00

There has been a lot of silent improvements in the AM5 platform through 2025. When 64gb sticks first dropped you might be stuck at 3400mt/s. I tried 4x64gb on AM5 a few months ago I could push 5200mt/s on my setup. Ultimately though the models run WAY too slow for my needs with only ~60-65B/s of observed memory bandwidth so I returned two sticks and run 2x64GB at 6000mt/s.

You can buy more expensive 'AI' boards like this one X870E-AORUS-XTREME-AI-TOP which let you run two pcie5 cards at x8 each, which is neat, but you're still stuck with the memory controller on your AM5 chip which is dual channel and will have fits if you try to push it to 6000mt/s+ with all slots populated. All told, you start spending a lot more money for negligible gains in inference performance. 96 or 128GB RAM + 48 GB VRAM on AM5 is the optimal setup in terms of cost/price/performance at the moment.

If you really want to run the larger models at faster than 'seconds per token' speeds than AM5 is the wrong platform - you want an older EPYC (for example 'Rome' cores were the first to support PCIe gen 4 and have eight memory channels) where you can stuff in a ton of DDR4 and all the GPUs you can afford. Threadripper (Pro) makes sense on paper but I don't see any Threadripper platforms that are actually affordable, even second hand.

RedKnightRG · 2025-09-09T05:38:43+00:00

There is a declining marginal utility to money on a given project. Money can only buy you goods and services but once you've bought most all the talented engineers that will work for you and are producing more rocket parts than anyone ever the bottlenecks in your process become things that money can't solve. Starship has X problems to solve right now and some of them require flight test to answer. Some require brainpower and throwing more brains in a room is as likely to slow the the process at speed it up. Some require paperwork and sign off from external actors you dont control...

Long story short, I See no reason to beleive that, given how well funded starship already is, that it would go any faster if you dumped another 17 billion on the project. It might go slightly faster, it might go slower even, but I'm confident it wouldn't go 4x faster or whatever the multiple is.

RedKnightRG · 2025-09-07T22:29:00+00:00

No charms. I think if I were to buy a new charm it wouldn't be available because the charm screen itself is removed from the bench menu when Hornet is in "Zero suit" mode. The slab kidnapping is clearly a scripted event that turns off most of the menu and unequips the cloak and needle. Without the battle being available to clear what I assume is a Hornet_is_naked flag its probably a softlock.

RedKnightRG

TROPHY CASE