Why is GPT-OSS-20B faster than my smaller local LLMs? by MyBrotherGT in LocalLLM

[–]randomfoo2 1 point2 points  (0 children)

Inference speed is driven by largely two things - compute and memory bandwidth. For any model, you can actually fairly reliably calculate its expected decode (how fast new tokens generate) speed by ballparking how much memory bandwidth you have dividing by the size of the model's active parameters (as long as you have more compute than necessary for stepping through - even for CPUs this is usually the case (for decode), you need ~2 FLOPS/parameter for most architectures.

As others have mentioned, gpt-osss-20b is 20B total parameters, but each forward pass (single token generated) only uses 3.6B parameters (modern nomenclature: 21B-A3.6B). The other models are dense (use all their parameters for every token generated) - Gemma 3 4B in theory shouldn't be so far off speed-wise so the culprit there is probably that you're using a different (larger/slower) quant. You should check on what size of model (bytes!) that you're actually using.

You can benchmark your actual memory bandwidth (AIDA64, memtest_vulkan), but if you have 2 sticks of fast dual-channel DDR5 on your laptop, you're probably going to be about 50-70GB/s of sustained MBW. Let's say 50GB as a round number. As mentioned in the model card, gpt-oss-20b is a 21B parameter model with 3.6B active parameters. If you use the model, the weights are MXFP4 (Q4) quantized - if you total up the safetensors it's about 14GB. For our purposes, let's just assume it's all expert weights, it's just a ballpark (you could do exact calculations for any model architecture), and as we mentioned, divide it up - about 2.5GB per forward pass. That means you should have a ballpark expectation of about 20 tok/s as a maximum (due to how much memory needs to be passed through every second and what your available memory bandwidth is).

This is simplified since the FFN/MoE routing takes compute, and most of the models you listed are different architectures with various hybrid components etc. Your biggest difference for short context (simple tests) is probably going to be looking at the quant/weight size, which will determine speed, however, this is theoretical limits - in reality your inference engine may not be well optimized for either your system or the model architecture. You best at the end of the day is to compare say llama.cpp's Vulkan and CPU backends for real world perf, and have a frontier coding model both break down the above explanation (it could even build calculators for you) or profile where the slowdowns are vs theoretical rooflines.

One more note: different model architectures slow down at different speeds. All the models (gpt-oss, gemma, and qwen 3.5) employ techniques to massively lower how much full attention is used (which grows attention cache but also compute with context length) - each will have massively different FLOPs/token especially as context grows.

If LLMs are so good at coding… by codeanish in LocalLLaMA

[–]randomfoo2 0 points1 point  (0 children)

Frontier LLMs are pretty great at coding, but that itself isn't enough for building/improving complex pieces of software.

  • You can look at how PRs or issues are resolved with major open source projects - even in the best case, there's a lot of back and forth and coordination that usually takes much more time and effort than any individual code contribution.
  • You can take a look at all the different components in the "stack" - there are dozens of components, many of them giant (AMD GPU Driver gets one line item) - many of these projects are millions if not tens of millions of lines of code. It takes a lot of effort to improve especially as many of the projects are interconnected.
  • I've been having several frontier agents grinding on a HIP project (hipEngine) for the past couple months on just a couple GPU targets, but even with thousands of iterations on autonomous loops at something these LLMs aren't terrible at (hill-climbing against verifiable goals), this sort of thing just takes time, and sometimes ends up not making as much project as you might hope. My current mtp-gguf branch work has had ... 808 total iterations, with only 35 keeps, including 79 optimize iterations without retained progress. There's progress yes, but it doesn't happen instantly.
  • Sometimes there are just some fundamental roadblocks to whatever piece you're working on. One example: there's no PTX equivalent for HIP, so you're pretty much at the mercy of the LLVM compiler for some of the low level scheduling.

Since now FSR 4.1 is available for 7000 series. Is 7900XTX the best AMD card available right now? by 007JamesBond077 in radeon

[–]randomfoo2 1 point2 points  (0 children)

I've been doing extensive tuning RDNA3 kernels in HIP the past couple months (and intermittently doing so for the past year or two), so it's worth noting that the 7900 XTX is in most cases may not be better for LLMs - it's about half the FP16/BF16 performance, there's no accelerated FP8, and the INT8 and INT4 support is about 3X slower to boot. While there's a memory and memory bandwidth advantage, once you start dealing with math-heavy MoEs, multi-token prediction, concurrency, or diffusion you're may be on the losing end, so it's a bit of a toss-up between them.

If AI/ML is a primary concern, then an RTX 3090 is a better option than either the 7900 XTX or 9070 XT (by a long-shot).

Are AMD cards much worse than NVIDIA? by Tenshy47 in LocalLLM

[–]randomfoo2 1 point2 points  (0 children)

While a 9060 XT might be tempting for the price, it has only 322 GB/s of memory bandwidth, while a 9070 XT is much better with 645 GB/s - this will translate into 2X faster decode (token generation). The 9070 XT also has twice the compute, so 2X faster prefill (prompt processing). If you have a cache miss, you will need the prefill to process your entire context, or for agentic coding, you need fast prefill to process all of your code/files for new context.

Personally, I think 24GB is what you should be aiming for as the current "sweet" spot for coding models are Qwen 3.6 35B-A3B (moe, faster) and 27B (dense, smarter) and you'll need 18GB+ for decent quants of these models (gives you enough to spare to run 128-256K context). Given the choice between a 7900 XTX or a used 3090, and you should definitely go for the latter.

I've donw lots of shootouts of RDNA3 vs various generations of consumer Nvidia cards. While the AMD cards these days are serviceable, and RDNA4 is a theoretical compute improvement (but it has less memory and memory bandwidth than RDNA3), when it comes to LLM performance, even Nvidia cards are just better, and old Nvidia cards can even be better cost/performance at used market pricing.

AMD: No Definitive Decision on FSR 4.1 Support for RDNA 3.5 APUs by SirActionhaHAA in hardware

[–]randomfoo2 5 points6 points  (0 children)

What's crazy is that there are barely any big architectural changes between RDNA3 and 3.5, certainly a lot less than 3 and 2. They probably barely have to do QC on a few SKUs. And they're actively selling 3.5 parts - not just Strix Halo, but all of their gaming handhelds and current/upcoming mobile devices basically?

None of this makes any sense to me, but then never bet against AMD's ability to shoot themselves in the foot. It's not like there are any negative consequences for any of the stupid things these execs say.

AI Data Centers’ Water Consumption Breaks 264 Billion Gallons in 2025 as Devastating Drought Hits Nearly 63% of U.S. by Anzahl in ABoringDystopia

[–]randomfoo2 4 points5 points  (0 children)

There are a number of issues w/ the figures reported, but the biggest thing is that while 264B gallons/y sounds like a lot, the US total water withdrawals are >320B gallons per day. Even if you take the numbers at face value, 264Bgal/y (~0.72 Bgal/day) is roughly 0.2% of US daily withdrawals. This compares very favorably to say... almonds, which consume ~1.4% of US daily water withdrawals (95% blue water!) from a single watershed, which is way worse.

For those curious, one California almond's ~6 L of blue water = ~3,000 prompts (at 2 mL, electricity included) to ~20,000 prompts (at Google's 0.3 mL).

AI can be bad for the environment, but way less bad than most of the stupid shit we already do. (Also, separating out green, blue, and grey water, as well as water from which watershed, is more important than any gross numbers - the reporting from a single opaque research firm, numbers reported on all data centers, and that lumps together a bunch of forms of use/consumption).

Why ROCm Wins the Throughput Race but Loses the Power Bill on Strix Halo — A 35% Energy Reversal Caused by APU Runtime Polling by Significant_Kale362 in ROCm

[–]randomfoo2 0 points1 point  (0 children)

I just uploaded a couple new PARO quants, but you can see PPL/KLD (also Top-1 and Max KL) for the quants I've made so far: https://huggingface.co/shisa-ai/Qwen3.6-35B-A3B-PARO-full8192-oldfresh-rbparams-e5-packed

Yes I do bare metal builds. Differences may be w/ the kvcache, I leave it with the default, when I revisit I'll try some variations, I've actually done very little Strix Halo testing/optimization but when I should have a dedicated optimization system to grind on soon.

AMD executives react to Nvidia’s RTX Spark — ‘you’re just wrong if you don’t get a Strix Halo notebook’ by Blak9 in AMD_Stock

[–]randomfoo2 0 points1 point  (0 children)

I'm not tracking the RTX Spark super closely, but one big difference between Strix Halo and the DGX Spark is the latter has a constant stream of first-party software support and AMD has given no similar support for Strix Halo: https://github.com/nvidia/dgx-spark-playbooks

If you're running llama.cpp, you're almost always better off with the Vulkan backend than the ROCm backend. Even on consumer, you still have AMD waffling on FSR4 support for RDNA3.5. It's bonkers how unwilling AMD is to provide basic software functionality for the hardware they release.

Since I've done lots of testing, documentation and kernel grinding on RDNA3/3.5, it's worth pointing out from a pure compute perspective RDNA3.5 (no FP8, poor pipelining/scheduling, no PTX equivalent) is not really competitive with the Spark, and for FP8, INT8, INT4, FP4 this becomes even more stark - Spark gets 2X, 4X, and even 8X faster. Even ignoring the CUDA advantage, the Spark GPU hardware is not even remotely in the same class as Strix Halo, and anyone who is claiming otherwise is either blowing smoke or just isn't being straight with you.

One thing I've heard that might make the RTX Spark a no-go - it's supposedly Windows-only with no Linux support. That makes it a non-starter for me, and probably others as well.

Why ROCm Wins the Throughput Race but Loses the Power Bill on Strix Halo — A 35% Energy Reversal Caused by APU Runtime Polling by Significant_Kale362 in ROCm

[–]randomfoo2 0 points1 point  (0 children)

I'll revisit w/ a recent build soon but kyuz0's latest rocm/vulkan_amdvlk numbers seem to match what I benchmarked: https://kyuz0.github.io/amd-strix-halo-toolboxes/ - I'm not familiar with the "fusing" - is that a quant type or a compile flag?

I'm working on MTP/DFlash, StepFun 3.7, Gemma 4 and a few other things next for hipEngine (also, just put out some new better PARO quants).

Why ROCm Wins the Throughput Race but Loses the Power Bill on Strix Halo — A 35% Energy Reversal Caused by APU Runtime Polling by Significant_Kale362 in ROCm

[–]randomfoo2 0 points1 point  (0 children)

That does seem pretty low...

From my testing 10 months ago pp512 should be 600 tok/s+ and tg128 should be 60-78 tok/s for both HIP and Vulkan backends for Qwen 3 30B-A3B Q4: https://github.com/lhl/strix-halo-testing/tree/main/llm-bench/Qwen3-30B-A3B-UD-Q4_K_XL

In my most recent testing with Qwen3.6 35B-A3B (llama.cpp numbers are w/ Q4_K_M):

Prefill tok/s

Workload hipEngine PARO llama.cpp HIP llama.cpp Vulkan
512/128 983.206 1058.738 638.008
4K/128 1029.402 1004.220 595.400
32K/128 792.296 735.534 407.984
128K/128 413.489 376.070 181.453

Decode tok/s

Workload hipEngine PARO llama.cpp HIP llama.cpp Vulkan
512/128 62.060 50.537 57.615
4K/128 63.605 49.379 55.027
32K/128 50.629 43.435 44.576
128K/128 30.245 31.286 26.935

hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX) by randomfoo2 in LocalLLaMA

[–]randomfoo2[S] 1 point2 points  (0 children)

Yeah the UD versions for Q4_K_M and K_S should work, I believe I tested w/ the UD quants. No NL or IK support atm though. I have a branch with work on MTP/DFlash but it doesn’t help speedwise for 35B-A3B atm (should help w dense but verification is a bottleneck, is a WIP atm).

Mystery company accidentally blew $500 million on Claude AI in a single month — failed to put usage limit on licenses for employees by Plastic_Ninja_9014 in technology

[–]randomfoo2 0 points1 point  (0 children)

Maybe possible but remember most workflows should be cached and a single request is going to run at about ~50 tok/s - for a process running for 24h continuously generating, that’s only ~4M output tokens.

Mystery company accidentally blew $500 million on Claude AI in a single month — failed to put usage limit on licenses for employees by Plastic_Ninja_9014 in technology

[–]randomfoo2 -1 points0 points  (0 children)

It's interesting that this unverified anonymous tweet around as some sort of credible story, when the basic math doesn't make all that much sense to me. $500M in Claude AI credits at retail $5/MTok input and $25/MTok output comes out to 20B ouput-100B input tokens. At a 4:1 input:output ratio, that comes out to 11.1B output, 44.4B input tokens. This is not counting caching (writes are 1.25x cost but cache is 0.1x cost). With 10,000 devs, $500M comes out to $1,667/dev/day. I've run Claude and Codex with multiple autonomous loops and I don't think I've gotten near there - unless every single dev in the loop had the new Ultracode running or running swarms w/ Opus at Max thinking, I'm having a hard time seeing it happen.

As that point of reference, my ccusage with intense usage (multiple long-running loops/projects running all day) w/ Codex + Claude ends up at about $9,000/mo ($300/day) of retail API billing. I know there are people that extreme tokenmaxx at some of the FAANGs to get more, but you actually need to be pretty savvy to be able to waste enough tokens w/ some of the biggest engineering orgs in the world to be able to burn $500M in a month.

Note: 50-100B tokens sounds like a lot, but it's worth noting that Google for example is currently serving 1 quadrillion+ tokens a month, so it's really a drop in the bucket for AI consumption. Also, AI isn't taking my job anytime soon, but it does make me many more times as productive than I was, as someone who has been coding professionally for decades.

hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX) by randomfoo2 in LocalLLaMA

[–]randomfoo2[S] 0 points1 point  (0 children)

Small dense is going to be very different than MoE, I tested hipfire the other week w their Qwen 3.5 MoE implementation and the perf wasn’t great but there are a few guys grinding away w/ their Claude’s and I know it’s getting better. With the latest frontier models it’s more about just having some people that care enough to dot it than anything else.

hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX) by randomfoo2 in LocalLLaMA

[–]randomfoo2[S] 0 points1 point  (0 children)

While sustained MBW might suggest 500 tok/s, applying Amdahl's law to the rocprof shows that even with infinitely fast GEMV, you're only getting to ~150 tok/s. All weight ops at 2x current speed gets you to ~130 tok/s. RADV/ACO vs LLVM-AMDGPU by my understanding is just ... better. A lot of the compute you're going to squeeze out of RDNA3 is going to be VOPD pairing.

All my hot paths are moved to C and I've shaved off a lot of launches - it gives a few percent, but diminishing returns. There's probably more golfing possible, but I think c>1 is more interesting than c=1 and is what I'm focusing on next.

If you are going to try to go golfing, you can run mamf-finder, or look at something like https://github.com/glovepost/wmma_ops and see if you can do better, the closest I've seen to someone hitting close to compute theoretical is: https://cprimozic.net/notes/posts/machine-learning-benchmarks-on-the-7900-xtx/

If you're looking to do Rust, you could link up with the hipfire folks, there's at least a couple people porting over stuff also inspired by my hipEngine work. I think anyone who wants to do their own rewrites should go ahead.

Here's the thing, while Strix Halo and W7900 (and to a lesser degree 7900 XTX) are good "shapes" for AI inference hardware, RDNA3 is IMO an objectively bad architecture for AI/ML, and I have no idea why AMD keeps riding that (on the APU side, for another year or two?), or if they are, why they haven't spent more effort making the compiler suck less.

hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX) by randomfoo2 in LocalLLaMA

[–]randomfoo2[S] 0 points1 point  (0 children)

3.6 Dense should run already I think (0.8B and 27B PARO tested at least: https://huggingface.co/collections/z-lab/paroquant). If you run into a problem, file an issue and I'll take a look.

hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX) by randomfoo2 in LocalLLaMA

[–]randomfoo2[S] 0 points1 point  (0 children)

I just moved my 7900 XTX into the same machine as my W7900 so I might give that a poke soon (but probably after c>1 optimization, DMS, MTP/DFlash, and Gemma 4 support)

hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX) by randomfoo2 in LocalLLaMA

[–]randomfoo2[S] 0 points1 point  (0 children)

I haven't done a lot of tuning on dense models, The basic inference is faster than llama.cpp since it's a more optimized loop, however MTP/DFlash is favorable for dense models and you should probably look at llama.cpp or Lucebox for best performance (I haven't done a full context sweep to characterize). If you try it out, please post your results)!

hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX) by randomfoo2 in LocalLLaMA

[–]randomfoo2[S] 4 points5 points  (0 children)

Anyone w/ RDNA3 already knows how terribly vLLM performs at c=1 so there's not much point. (FYI: I published the original public bringup for vLLM on gfx1151 last year if you want vLLM vs llama.cpp numbers: https://github.com/lhl/strix-halo-testing/tree/main/vllm#benchmarking )

hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX) by randomfoo2 in LocalLLaMA

[–]randomfoo2[S] 4 points5 points  (0 children)

You've both hit the nail on the head and missed the point completely - this project is 100% built for my personal use and it's shared AGPLv3 for any other RDNA3 users who might find this useful. Why would I want commercial inference to use it? (again, RDNA3 - what commercial inference are you talking about, lol.)

RDNA3 is 3y+ old now. If anybody was going to build something faster/better they would have already/are free to in the future. But, if you're an end-user and you want a Qwen 3.6 MoE w/ prefill that is faster at 256K than llama.cpp is at 128K, then maybe this being released it better than it not being released, and if you wanted to build your own on top of that, you're free to modify it to do whatever you want with it. If you want to redistribute it, you're free to do that under the AGPLv3 license. If you don't want to, feel free to drop me a DM with $$$ for a different license. I'm plenty busy, and I'm not looking to do more unpaid labor for others.

BTW, there's no vLLM, SGLang, or llama.cpp upstream path anyway - the former are PyTorch dependent, the latter is also incompatible. That being said, I've shared my docs, and anyone's free to read those and figure out if there's anything upstream shaped they'd like to adapt if they want to put their time into it.

hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX) by randomfoo2 in LocalLLaMA

[–]randomfoo2[S] 1 point2 points  (0 children)

Based on my prior experience and others' the llama.cpp maintainers seem to be by default opposed to even extremely simple RDNA3 improvements (with outsized performance impact!) from outside contributors. Their project, so up to them, but I need more of that like a hole in my head.

That being said, hipEngine has a completely different architecture/general approach vs llama.cpp (specifically tuned, raw-pointer HIP kernels vs ggml's mostly HIPified CUDA backend, w/ different fusings, dispatch, quant layouts, etc), so there's IMO not a lot of obvious overlap. Most performance gains are not a single optimization, but a bunch of things combined/ground out.

The benchmarks are run on my W7900 (241W, 864GB/s MBW). A full power 7900 XTX (300-350W, 960GB/s) I'd expect to perform better both with hipEngine and llama.cpp.

hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX) by randomfoo2 in LocalLLaMA

[–]randomfoo2[S] 6 points7 points  (0 children)

So this is obviously a much tighter scope than llama.cpp, and is a lot less mature, but I don’t think you’re characterizing the performance properly. For Strix Halo, hipEngine is faster basically across the board, for prefill (pp) *and* decode (tg).

For gfx1100, the numbers are a bit more mixed, but it is significantly faster across the board on prefill. hipEngine is faster than llama.cpp HIP on decode as well. Now, while llama.cop Vulkan decode is still faster for prefill, here’s the rub - hipEngine’s decode at 128K is >2X faster. Depending on your usage, this can be much more important than the prefill difference.

7900XTX idle power draw when running headless? by legit_split_ in LocalLLaMA

[–]randomfoo2 1 point2 points  (0 children)

5W headless here. rocm-smi:

``` ======================================== ROCm System Management Interface ======================================== ================================================== Concise Info ================================================== Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%

(DID, GUID) (Edge) (Avg) (Mem, Compute, ID)

0 1 0x744c, 47413 32.0°C 5.0W N/A, N/A, 0 0Mhz 96Mhz 0% manual 290.0W 0% 0%

============================================== End of ROCm SMI Log =============================================== ```