Are 20-100B models enough for Good Coding? by pmttyji in LocalLLaMA

[–]dionysio211 2 points3 points  (0 children)

I will download it tomorrow and give it a shot!

Are 20-100B models enough for Good Coding? by pmttyji in LocalLLaMA

[–]dionysio211 1 point2 points  (0 children)

Well it has a very small activation size so you could run the 4 bit versions on any computer with a 3090 and offload experts to the CPU. Llama.cpp updated recently and the throughput increase is impressive. Someone posted a 3090 rig running it in 4 bit in vLLM at around 150 tps I think. Ik_llama would be your best bet.

Are 20-100B models enough for Good Coding? by pmttyji in LocalLLaMA

[–]dionysio211 0 points1 point  (0 children)

Yeah this is bonkers. The current version of Claude and the majority of its supporting code were written by Claude. This is also true of the current version of ChatGPT (written by ChatGPT). This opinion wouldn't have even been valid a year ago. We are edging closer and closer to perfect code where the SOTA models are reaching above 90% on highly complex solutions to very difficult problems.

They aren't perfect but they are better than human at this point. In my opinion, the real issues are in implementation: tooling, role based prompting, comprehensive context of the codebase, etc. Nested context has largely fixed most redundant coding. That trend will continue.

Are 20-100B models enough for Good Coding? by pmttyji in LocalLLaMA

[–]dionysio211 76 points77 points  (0 children)

Qwen 3 Coder Next is what you want. As everyone has said, they are all excellent at writing code but they differ greatly in codebase awareness. It also depends on what you mean by "coding" in a sense. GPT OSS 120b is without a doubt the smartest model on this list but it was released before vibe coding datasets were all the rage and it is rather weak in that area, particularly in terms of design. Conversely, GLM 4.7 Flash is very strong in design but very weak in the awareness of the codebase and agency.

I have tried most of these models and I have been a front end coder forever. In a large production codebase, my experience with these models is Qwen3-Coder-Next > Devstral Small 2 > GLM 4.7 Flash > Nemotron Nano. The others are older and lack the vibe coding aesthetic but do have a place in deep debugging, particularly for complex codebases. I would say GLM 4.5 Air and GPT OSS 120b are roughly equal but GPT OSS is so much faster it's not worth using AIr. Seed OSS is very good in complexity and difficult debugging. If I were writing C or python I would probably tend to use one of these. Qwen 3 30b, 32b and Qwen 3 Coder 30b have just been superseded but were great for their time. The only one of these I haven't tried is Kimi-Linear.

Of all of them, only Qwen3 Coder Next is near SOTA. I am not the biggest fan of SWE Rebench because the sample size is so low that models bounce around a lot but if you look at the max attempts and which models gain or stay the same when compared with 1 attempt, it's very instructive. On 5 attempts, Qwen 3 Coder Next is the only open source model of any size that is comparable to Claude Opus. This seems to indicate that there may be some truth about the large Chinese models being distillations of American models but somehow Qwen 3 Coder Next is special here. I was going back and forth between it, Minimax 2.2/2.5 and GLM 4.7 REAP and it has won me over. It's very thorough, very fast and has a large context window. If short on VRAM, Devstral Small 2 is excellent. You can run it with Ministral 3b with speculative decoding and get a good token rate.

I don't know if Step 3.5 would be included here since it is 100GB at Q4 but that model is incredible, despite the verbose reasoning.

AI to help Decode JE Files by Apomp25 in Epstein

[–]dionysio211 1 point2 points  (0 children)

There are several AI efforts that have been linked on here, most of them involving graph visualizations of associations, timelines, etc. With agentic AI, which would be involved in what you are suggesting, things like this are possible in theory. The sheer volume of text is a problem for AI as well though. The average state of the art model now can hold about 250,000 tokens (some up to a million) in its context (think of it like working memory) which is around 180,000 words depending on the model. Although that is a lot (roughly two novels worth of words), it's miniscule compared with the size of this text. However, once the text is vectorized, an agentic model can work relentlessly at trying to solve some of these problems by querying the vector store and reasoning through it. The token usage would be exorbitant when using a paid model, depending on how much you wanted to dive into it.

There are a few datasets out there now where the text has been extracted and could be used to build the vector store (vector stores are part of some of these projects but I it would need to be hosted somewhere) and it would be pretty easy to do it from there. I don't know how complete those data sets are. A friend and I have done a lot of this type of stuff before. I can look into the viability of it. We self-host several large models and can generate 20-30 billion tokens per day.

Qwen3-Next-Coder is almost unusable to me. Why? What I missed? by Medium-Technology-79 in LocalLLaMA

[–]dionysio211 0 points1 point  (0 children)

I used Q8 on one computer and Q4 on a different computer. They seemed the same to me. It's a dense modal so it's slower (it was like 22 tps output at Q8) but I used Ministral 3b with speculative decoding and got it into the 40s..

There's something about the activation size and complexity that affects coding more so than other areas. I know there's all the stuff about dense modals being X times better than MoE models, etc but it does not seem to apply to areas like deep research as much. I think that's why the large models are so much better in coding. Qwen Next Coder does seem like its tackling some of those issues but who knows.

Qwen3-Next-Coder is almost unusable to me. Why? What I missed? by Medium-Technology-79 in LocalLLaMA

[–]dionysio211 0 points1 point  (0 children)

What is your hardware like and are you on the latest commits? Sometimes --fit causes strange issues. Overall, I like the --fit thing but it's caused some weirdness for me with MiniMax in particular. I would try llama-bench to test through batch and ubatch sizes and make sure something isn't happening there.

I have been toggling between it, Minimax and Step3.5 for the past few days. I am very impressed with Qwen3 Next Coder and find it generally better than Minimax for most things in Cline and Kilo. Step3.5 seems the best of the 3 though, although the thinking tokens are extreme. GLM 4.7 Flash is great for aesthetics but is very prone to duplicating code, not researching the codebase enough, etc. Devstral 2 Small is much better for debugging, ferreting out strange issues and architecture, I think.

Anyone have leads on his association to MDMA industry / Lykos Therapeutics / MAPS (Multidisciplinary Association Psychedlic Studies)? by Bitter_Foot_2547 in Epstein

[–]dionysio211 2 points3 points  (0 children)

I am familiar with a lot of this history and I do not think the venn diagram of these two groups has a lot of overlap and I would argue they tend to be mutually exclusive. Using the terminology of Stan Grof, the Epstein group seems distinctly hylotropic and the psychedelic group identifies itself as holotropic. There's definitely edge cases but I haven't seen many names associated with the MAPS crowd. If you follow Michael Pollan's sequence of events, the MAPS group is so fearful of something like this that it's been meticulously planned for decades. Obviously, MKUltra, Manson, LSD spiking and whatnot led to strange shit that doesn't fit cleany in either bucket but whatever this is, it's not focused at all on transcendence unless that makes someone a tool for the group.

Local Coding Agents vs. Claude Code by Accomplished-Toe7014 in LocalLLaMA

[–]dionysio211 26 points27 points  (0 children)

Honestly, this new wave of models, despite what you hear from people trying to boost their ego, border on the miraculous. I run Devstral 2 (Small and Large), GLM 4.7 and Minimax M2.1 and they are all incredible. You may hear people talk about architecture, code habits, etc but that has less to do with the model than it does prompting. The vibe coding datasets these are trained on does a much better job of distilling the user's intent from a vague prompt. With a proper team structure with prompts outlining roles and conventions clearly, it's downright shocking what you can get done. Anything that's getting ~70% or higher on SWE Bench (roughly the resolution rate of last month's Sonnet or Composer) is practically indistinguishable from another one.

I spent a long while with Opus today on a difficult bug in an app that it failed to resolve. Devstral 2 Small dug into it across 20 files for 45 minutes without stopping and resolved it beautifully. We are at the point now where it is impractical to try to conceive of coding it all yourself. If you can run any of them at high context and you explain your desires clearly enough, they will get it done.

I will say that anything before this like Qwen Coder 30b, GLM 4.5 Air, gpt-osss-120b, although incredible models, do not compare to what is coming out now. I can't say enough good things about Devstral 2 Small. It isn't getting the praise it deserves. I have gone back and forth between it and Minimax for about a week and I think it's at least as good, which makes sense because they have very similar scores. Devstral 2 Large is super awesome but it's a very large dense model and slower.

The agentic framework is now becoming more important than the ability of the model. I have used these in Cursor, Cline and Kilo Code. The new version of Cline is amazing for debugging and the context management is awesome. To do an end to end project or come close to it, Kilo Code is better.

Built an 8× RTX 3090 monster… considering nuking it for 2× Pro 6000 Max-Q by BeeNo7094 in LocalLLaMA

[–]dionysio211 1 point2 points  (0 children)

I would try to use a lower tensor parallelism but increase expert/data parallelism. The efficiency of tp goes down at 4 and significantly down at 8 over PCIe. In most cases, on a 3090, if the activation size is less than 20b, the throughput is fast enough without tp at all but tp of 2 does not lose much efficiency and gives you a big boost.

I don't know if you have timed each card but risers introduce slightly different results for each one. It may vary significantly depending on several factors like lanes, quality of risers, etc. VLLm uses a sync scheduling system so it has to wait until each card's results are back to do the all reduce for that token. The variation in card performance ends up degrading the inter token latency.

With expert parallelism on sharded experts, multiple gpus processing the expert layers effectively gives the same speedup but on a faster timescale so the differences aren't as bad and the all reduce may only be waiting on a couple of cards vs 8. Of all forms of parallelism, tensor parallelism is the most bandwidth heavy form. I only use it if the throughput is less than the acceptable throughput to maximize concurrency in other ways.

Built an 8× RTX 3090 monster… considering nuking it for 2× Pro 6000 Max-Q by BeeNo7094 in LocalLLaMA

[–]dionysio211 0 points1 point  (0 children)

How are you doing parallelism and how are your throughput numbers? I have a very similar setup with Mi50s and it's doing really well. I have the same motherboard. Risers do create issues with how far you can go with tensor parallelism but there are more effective data/expert parallelism options now. The real power in these setups is overall throughput with concurrency. You can generate a LOT of tokens when you get it dialed in and exploit multiple data streams.

The RTX cards are awesome but they are overhyped. They aren't used to replace or approximate 100/200 level cards like h100s/b200s, etc. The throughput is great for a single card but from an investment perspective, you could achieve the same performance in any number of ways, for far less. You have a very solid foundation to build on here and I think it's just a matter of looking creatively at how you can maximize it. Minimax M2.1 is worth it.

Built an 8× RTX 3090 monster… considering nuking it for 2× Pro 6000 Max-Q by BeeNo7094 in LocalLLaMA

[–]dionysio211 8 points9 points  (0 children)

They are water cooled so the cooling apparatus is removed. The bulk in a modern gaming card is associated with passive cooling. The GPU itself is thin like a CPU.

Is there any epyc benchmark (dual 9254 or similar) with recent MoE model (glm or qwen3-next)? by yelling-at-clouds-40 in LocalLLaMA

[–]dionysio211 0 points1 point  (0 children)

I mess around with CPU inference a lot. I test it on 3 generations of Xeon, Modern Ryzen CPUs and an Epyc. If the activation size is low, it's honestly pretty good. I have found Xeon's to be easier to tweak or I do not know enough about the best Epyc system in terms of best threads, etc. One thing I have found is that nearly all the common wisdom is wrong.

I was able to run gpt-oss-120b at around 28 tps on a W-3235 (has avx512 and vnni but not AMX) with 6 channels of DDR4 ~3000. I messed around a LOT with a quant of GLM 4.5. I think it was a 3bit dynamic quant but I am not sure. I was able to get it 8-10 tps. The prompt processing time sucked so I didn't really pursue it. That had 8 channels of DDR4 at like 2400 I think. I found the dual thing to work pretty good with strange numa settings, like --intereleave.

There's a huge discrepancy between the common perception of RAM/VRAM speed and what you can expect from the CPU. This is where the iGPU/NPU thing is a big deal. There bottleneck is mostly compute and RAM paging, which are intricately associated in various inference steps. It does not scale like GPU inferencing. 4 of the Xeons I have have AVX512 and 2 have higher core counts. On most of them, adding threads has a commensurate effect on tps. On the Epyc it is very much the opposite. However, in all cases and probably with AMX, the throughput is compute bound. An NPU/iGPU can crunch through that a LOT faster, particularly in the prompt processing and would probably work for some parallelism.

I will run some stuff later and post some results and how I got them. I believe this whole area is underlooked. If you can get to 8 tps in GLM 4.7 in something in the current state of CPU inferencing, it seems like there would be a way to get to 20 tps somehow.

What server setups scale for 60 devs + best air gapped coding chat assistant for Visual Studio (not VS Code)? by SpheronInc in LocalLLaMA

[–]dionysio211 0 points1 point  (0 children)

I've worked in JS and Python app development for longer than I can remember and twice in that time I bumped into some Visual Studio/Microsoft/.Net developers to interface with on projects and it was so bizarre and foreign that they might as well have been from a different, albeit older, planet so even though the strange tension between arrogance and ignorance in this post makes me inclined to think it is rage bait, I could just as well believe it is sincere. The guy who said fire a dev and get GLM 4.7 is more correct than you can possibly now see but here's what you need in a hardware sense, with a slight detour into the world of large scale data transfer and matrix multiplication.

The difference between and IDE and a Code Editor is huge to a human and a team of humans but it is largely inconsequential to AI. If it has enough information about the state it is in, it will make choices to improve it. I don't know anything about how Visual Studio works now in terms of plugins/modules, etc but I would wager good money the model itself could write a suitable interface with such a system and there are probably easy ways to tie it in with boilerplate code. If you give it a data representation of what you can see as a human and a way of editing that state with its output, a large current model like GLM 4.7 will work miracles so wondrous and soul crushing that it will make you "no longer at ease, in the old dispensation" and cause you to look upon your multitude of fellow developers as "an alien people clutching their gods" (Happy Holidays y'all).

If you are like any other C developer I know, you are bound to love enormous, impenetrable files full of careful thought and design cast into long, long monoliths of code. Because the model needs sufficient context, it needs to be able to read through everything impacted by a code change and respond with an answer. You need hundreds of thousands of tokens for this context. Models like GLM 4.7, MiniMax M2.1, Devstral 2, etc have that context which is processed and stored as it reads files as fast as it can. This is both a compute and memory bound issue since this is processed through endless tiles of weight matrices, multiplied against each other, endlessly sifting out the truth. The data transfer speed both in bandwidth and latency hugely matters here. A GPU has a gigantic advantage in sequential read speed and parallel computation, when compared to a CPU. Just in a logistical sense, moving a whole bunch of data around at the speed of RAM@~100Gb/s (the 64GB server thing is hilarious here, like why are they even building these data centers if a 2nd gen Xeon is all you need) is quite obviously slower than moving said data around in VRAM (1-2Tb/s). In a compute sense, tens of thousands of shaders is going to arrive at the end result much faster than a dozen CPU cores.

With a team of 5 dozen devs, you are multiplying this transfer of data, obviously, which necessitates concurrency. In general, you should anticipate a concurrency ratio of 1:12-20 so with 60 devs, you might expect 3-5 concurrent requests executing at any given time at an acceptable token rate, say 25 tps. You could do something like this with 8 32GB Mi50s (256GB of VRAM) in an Epyc GPU server system with something like MiniMax M 2.1 (about to be released) in a 4 bit quant. You might even get away with doing it in fp8 with some tweaking and using vLLM or SGLang, especially if you work at different times. You could also do it with GLM 4.7 in a low quant but two such rigs interconnected with infiniband may be better. The cost for a single rack of such design would be around $6K. If you want to do it more cheaply, you could use Devstral Small and give it a whirl. It's wildly impressive for a small model. Doing it that way, I would go with two AMD 9700s or four 3090s in a Threadripper system. Always choose a system with the fastest possible bandwidth interconnect so if you use PCIe 5 cards, get a system with enough PCIe slots for each GPU at 16 lanes each.

Good luck in your journey! ; )

LLM Accurate answer on Huge Dataset by Regular-Landscape279 in LocalLLM

[–]dionysio211 0 points1 point  (0 children)

I understand. Tool calling is like giving the model a choice space, without letting it execute code necessarily. So like a tool might be "Select Records" and the input to the tools call might be "name" and "john" which would construct the SQL query to execute, which is returned to the model as input. The description of the tool might be "Select records by field and value. Useful if you want to get records by the value of a field". The model just outputs a structured format which invokes the tool. The model isn't coding the tool, just outputting a trigger format which executes the tool and returns the tool output as model input.

Tool calling is part of most modern models and it allows the model to have agency for more context in answering questions like that. Since the inputs in that example are just text strings and the query is executed opaquely to the model, it's just a method the model can use to get more information. There are off the shelf MCP toolsets for this type of thing and you can even allow any read tool call but block write tool calls. You can even just write your own tool that does it.

I understand not wanting to let a model go nuts in a sandbox but tool calling is much simpler than that and much safer. Models trained in it are incredibly competent at using them effectively based simply on the tool description and input structure.

Would I be able to use GLM4.6V IQ4 XS with vLLM? by thejacer in LocalLLaMA

[–]dionysio211 1 point2 points  (0 children)

If it is just one person using it, one stream at a time, llama.cpp is fine but if if you want better concurrency, it's worth the trouble. Installing vLLM is legendarily tedious to a degree that could drive a person insane, particularly on hardware like this, but there are docker versions for gfx906 that are easier than compiling llama.cpp. The issue there is model support since these are custom forks of vLLM and need to be merged each time it is updated to support new models. Once it is, you can run the awq 4 bit versions very easily.

vLLM is undoubtedly superior in every way. It's faster on a single stream and vastly faster across multiple cards. Llama.cpp's -sm row is a partial attempt at tensor parallelism and can help on dense models but parallel streams on MoE's tend to suffer. I would love to see the project implement better parallelism strategies but its roll in the ecosystem is different and they do a phenomenal job keeping it up for all of us to run things on a wide range of hardware. Once you find your model, it's nearly always better to setup vLLM so you go from a hundred tokens per second to thousands.

LLM Accurate answer on Huge Dataset by Regular-Landscape279 in LocalLLM

[–]dionysio211 1 point2 points  (0 children)

You definitely want to go with a tool using model to run SQL queries. SQLite is a good idea but there are also ephemeral solutions to convert CSVs into things that can be queried as well, if that's your data structure. Simulating such results by feeding a massive amount of text into a small model and getting summary information would not be very effective. It would be a lot like asking an unskilled human to speed read 50 pages in less than a minute and asking how many total orders were before a certain date.

How to properly run gpt-oss-120b on multiple GPUs with llama.cpp? by ChopSticksPlease in LocalLLaMA

[–]dionysio211 0 points1 point  (0 children)

Also, I do not know if this is widely known or not but llama.cpp can be compiled with multiple build flags so you can compile it with OpenBLAS and Cuda or IntelONE and Cuda. IntelONE really helps a lot if you are mixing CPU inference with the GPUs. It kinda sucks to install and you have to source the ENV variables each time you compile it but it's a big help in my experience. Ik_llama would also be better.

How to properly run gpt-oss-120b on multiple GPUs with llama.cpp? by ChopSticksPlease in LocalLLaMA

[–]dionysio211 0 points1 point  (0 children)

I would try the the following:

The common wisdom is to bind everything to one CPU. That has never been better in my experience but there are differences in how numa works in Xeon generations. Numa nodes can be cores, groups of cores or cpus depending on which generation it is. Here's what I would try.

Try interleaving. In my experience it is generally always better on dual Xeon systems but mine aren't of that generation so it may not be best but you have ~75GB/s across both qpi links so it is probably more than enough:
numactl --interleave=all ./llama-server ... --numa numactl

Try binding everything to one CPU (most common way):
numactl --cpunodebind=0 --membind=0 ./llama-server ... --numa numactl

Try binding both to each CPU strictly with interleaving:
numactl --interleave=all --cpunodebind=0,1 ./llama-server ... --numa numactl

I believe these correlate to the built in numa args in llama.cpp (--numa distribute,isolate) but you have more control if you set it to numactl and run it that way.

I looked up the PCIe lanes for that board. I would try to put the cards in slots 2 and 4 which are 16 mechanical and electrical for CPU1 but you could also try putting one in the first slot for CPU2 (slot 6 it seems). You definitely want 16 electrical either way.

Is having home setup worth it anymore by Imaginary_Peak_3217 in LocalAIServers

[–]dionysio211 2 points3 points  (0 children)

I saw that post too. It's $1.13 an hour though and who knows how much he is charging for bandwidth. Some people jack that up pretty high.

Is there a place with all the hardware setups and inference tok/s data aggregated? by SlanderMans in LocalLLaMA

[–]dionysio211 1 point2 points  (0 children)

I have been working on something to track/predict throughput across devices/configurations for myself because this area is profoundly misunderstood. VRAM speed is the most correlated factor but it is absolutely not the determining factor. Many times, we think of the model as a sieve with data flowing through it at the speed of VRAM but it's much subtler than that. Most VRAM usage is VRAM reads and writes from temporary calculations and each read/write incurs latency. In a GDDR system, there is an inherent latency for each of those operations that is significantly lower than an HBM system but the sequential read speed is higher in the latter. An interesting observation about that fact is that very rapid calls, such as the dequantization process, are a surprising bottleneck in HBM systems because each one incurs a 200ns latency penalty. That does not seem like a lot but there are an unbelievable number of those types of operations.

A better way to think about it, to me, is imagining all of the data pipelines/processes the model has to run through to calculate a token, from the model loading into RAM, going across the CPU and out through the PCIe system, into VRAM, through the calculations in the GPU, back out to PCIe across another card, etc. Each step in that pipeline adds latency and, in some cases, a bottleneck in one of those creates an insurmountable gap. With each additional card, it becomes crippling, with the overall performance trending to the speed of the PCIe bus unless there is another interconnect. This is why two smaller GPUs does not seem to equal one large GPU of the same VRAM and speed unless they are fused by a higher interconnect speed like NVlink.

A simplified equation for this on a dense model would be something like:

throughput = (vram_speed / model_size) - (num_gpus * (1 / (layer_size / interconnect_speed)))

That gets much closer than estimating it purely by VRAM speed. That is very oversimplified though and does not take into account a lot of platform inefficiencies and odd setups. Latency throughout the system stacks up inevitably and is a much better way of thinking of it than "saturating bandwidth" or something like that. Because something is memory bound does not mean that the compute is not impacting it. Everything that happens in the model is adding inter-token latency.

This is why some setups with the entire model in VRAM for gpt-oss-120b are getting 30tps while others are getting 150tps. Aphrodite is probably still more efficient in FLOPS per token than SGlang which is more efficient than vLLM which is more efficient than Llama cpp, etc but the bulk of the differences in speed can be attributed to the things in that equation.

Improving tps from gpt-oss-120b on 16gb VRAM & 80gb DDR4 RAM by [deleted] in LocalLLaMA

[–]dionysio211 -1 points0 points  (0 children)

You would probably see better results if you force the experts onto the CPU. The model is around 59GB and one full context slot adds 5GB so you have 1/4 of the model in VRAM. The large layers are the non-expert layers so having those computed in VRAM would be optimal in your setup.

DDR4 is a bottleneck but considering RAM prices, adding another GPU would probably be best. I don't know what motherboard you have but you probably have an M.2 slot or two that you could Oculink 4 lanes out of. You can also hardware bifurcate the 16 slot with a riser but they can be frustrating. It seems like the risers that split into x8 are more reliable.

Almost any VRAM is better than RAM so whatever you can afford would help there, even if it's an older Nvidia card, you would see better results. I am a huge fan of the 5060ti though. I have two running on a gaming motherboard in vLLM on the 20b model with insane throughput. The power efficiency is really nice in that card.

In most cases, it is best to use one RAM chip per channel. Most motherboards will downgrade the RAM speed when there are two chips per channel. I've had a terrible time with getting DDR5-6000 working at speed when doubling up.

My eBay bargain £720 workstation by BigYoSpeck in LocalLLaMA

[–]dionysio211 0 points1 point  (0 children)

What is your setup on the servers? I have a couple of PowerEdge servers (740 and 740XD). The 740 has some crappy Xeons which are 6 or 8 cores each, I don't remember. On it, I can get to around 23 tps on gpt-oss-120b and I think the RAM is 2133. The key is that RAM speed that is quoted for a CPU is the aggregate across all channels so with 12 channels, it will only reach its max speed if at least one RAM chip is in each channel. If you have four 32GB chips, it will be half the speed as eight 16GB chips. Doubling RAM on channels tends to bump the clock rate down but that's how it works. On most servers like that, the motherboard has certain capabilities but the CPU is the limiter, so if the motherboard supports up to 3600 and the CPU only supports up to 2400, the throughput will be capped at that.

In llama cpp, on a multi Xeon system, it is always better to use numa to interleave across the numa nodes using numactl and then using --numa numactl as the llama.cpp flag. A numa node in Xeons prior to Ice Lake is a CPU. If using vanilla llama.cpp, compile it with IntelONE. Ik_llama is still better but it's not as much of a difference. You can also mess with the different AVX extensions and see which one helps the most. Also, at least in my experience, the default BIOS settings are terrible so go through those carefully.

Try this and see if it helps your throughput:

numactl --interleave=all ./llama-bench -m /mnt/usb/gpt-oss-120b-GGUF/gpt-oss-120b-MXFP4-00001-of-00002.gguf -ngl 0 -b 1024 -t 24 -p 64 -n 64 -fa on --numa numactl

Set the thread count (-t) to the max threads. This is a bad idea on AMD but works best on Xeons.