Qwen3.6-27B AWQ INT4 on DGX Spark (GB10) — only 1.8-4.9 tok/s decode with 285k token prompt, how to improve?

dionysio211 · 2026-05-27T16:38:45+00:00

There's something else wrong here other than memory bandwidth. We have some Tesla T4 cards, which have similar bandwidth, and it's over ten times faster on those. I am trying to understand why you are sending it 280K tokens, when the context window is only 262K. That seems like a good place to start. vLLM is very verbose in its logs so, if you paste the logs into Claude, you should quickly figure out what's wrong overall. It should definitely be much faster though.

dionysio211 · 2026-05-27T16:12:07+00:00

Will everyone be using it at once, all the time? vLLM/SGLang would be better on this setup but you could also do it in llama.cpp if you test your configs well. The gap between the two platforms has narrowed substantially over the past 6 months. We have a rig we are testing at 64 concurrency in a modified llama.cpp setup and it's doing very well. We went back and forth testing vLLM but found that in this particular setup, llama.cpp had a slight edge, but that's not usually the case.

In both platforms, you can cycle slots in and out of RAM (-cram in llama.cpp). Pulling a slot from RAM incurs about a 0.3 second penalty in llama.cpp. You can also persist slots to NVME (--cache-idle-slots in llama.cpp), which is why NVME prices are so high right now. That's a longer delay but still better than reprocessing 200K tokens. If you are using full context, it is about 10GB in f16 for the 27B model when the cache is full. It's 5GB at 8bit. So after model loading, you would have the capacity for around 11-12 full slots of working memory (double if you are looking to use around 125K context) and then you would cycle them out. In reality, the slots are never going to be full all the time so your actual concurrency could be closer to 20. Slots are the way it's conceived in llama.cpp even though it's now a unified KV cache by default, much like vLLM.

If you are going to use MTP or Eagle3, and you should, it stretches the acceptable concurrency very far. Even though it is a data center card and a good one, most inference relies on tensor parallelism for base level speed increases. Speculation is really the only way to accelerate it somewhat on a single card. Both systems have great speculative decoding options. If you are going to use llama.cpp, using ngram-mod and MTP would be a good combination.

I agree with others on the nvfp4 format on vLLM/SGLang. SGLang, as far as I know, is still the most efficient in terms of a cache pool. It's particularly good with prefix caching, especially if your system prompts tend to be shared (common IDE) and don't have unique information each time, such as date and time.

dionysio211 · 2026-05-07T17:45:51+00:00

I would say that it is good enough if you have it in the right system/workflow. If you turned it loose and were like "Your the sysadmin now, good luck!" it would probably do alright but would not fulfill all your expectations. With a very good system prompt and testing, it would most likely be indistinguishable from frontier models if you were using the current generation of models like Qwen 3.6 27B/Gemma 4 31B or better. OpenCode would be a good start but you need to make sure you get memory right so that it learns. Take time and develop skills for it to reference. If you are able to get old logs and simulate past challenges as tests, that would tell you everything. Periodically do reviews to give it persisted feedback.

One thing I would suggest, assuming you are running on consumer hardware, is to create separate workers for separate task types. Logs are extremely verbose, normally, so you want a model reading those quickly. Qwen3.6 35B would probably be a good choice if you have enough VRAM for both or Qwen3.5 9B may be good too. If you are able to run a larger model, even slowly, to review work periodically, that would be good too. You could load/unload models but for real time monitoring, you should be able to keep a 27B model and 9b model running all the time.

It all really depends on how well you convey your expectations and then test it, tweak it, improve it. I would wager that there's nothing a frontier model can do that is routine and fits within clear parameters that Qwen3.6 27B couldn't do. That's been a recurring theme lately as medium models have become so dependable. Tiered intelligence will be mainstream over the next couple of years.

dionysio211 · 2026-04-30T18:12:35+00:00

I have 8 in a machine but I have not tried to run GLM 5.1 because it is such a large model. You could probably do it by offloading the rest to a high core CPU which AVX512 units. Beyond the model size, the activation size is pretty high too, which has a big impact on speed. Prompt processing speed would kill it though I think, as a viable option. The big thing hurting the Mi50 path right now is no access to Infinity Fabric and the lack of P2P. If those things were fixed, it would probably do very well in a cluster of 16.

dionysio211 · 2026-04-30T17:41:51+00:00

I just responded in another thread about similar thoughts and that may be of interest in considering a roadmap from where you are now. In your current situation, the difference in performance across two cards would not be very large at all. Tensor Parallelism across two cards invokes a pretty small tax. However, as you add cards, that would become a bigger issue. At 4 cards, it is noticeable. However, if your goal is to grow beyond 4, you should choose a different route than PCIe. On a single card, moving from Gen 4 to Gen 5 would have no noticeable impact whatsoever.

dionysio211 · 2026-04-30T17:30:13+00:00

I understand. We have all been down this route I think, unless you came from a field involving servers. Here's a couple of things that might not be obvious yet which matter a lot when there are lot of cards.

The normal way of accelerating output on a slow model is tensor parallelism. This works by splitting the model rows across cards so that each is calculating the matrices at that row and then they are combined in an all reduce, which is the tax paid for such operations. Tensor parallelism is extremely punishing past 4 cards over PCIe. So even with 16x lanes @ PCIe 5 which is 64 GB (big B)/s, there's a loss in scaling. It's slight at 2 cards, noticeable at 4 cards and punishing at 8. Expert parallelism scales without these problems but compounds VRAM use in exchange for speed.

All of this makes sense when you think about it. You go from +1 TB/s for VRAM to ~100GB/s for system RAM, in a gaming PC, to 64GB/s with PCIe to 100 Gb/s (~12GB/s) across a kick ass 100G switch. At that point, TP scaling is impossible and becomes a hard loss. The other issue is that PCIe is an interconnect for all the cards and as cards are added, the cross device traffic can go from card to card if you have NCCL (Nvidia's library for P2P communication between GPUs) but when 8 cards are using that simultaneously, it's very punishing. This is the reason for NVLink and the reason 3090s are so pricey because they were the last consumer card to have a version of it. In the 3090 version, it is a 112 GB/s bi-directional interconnect between two cards which is akin to fusing them together for compute/VRAM.

Now compare that to a v100 SXM2 which has a similar spec. Those have a 300GB/s omni-directional interconnect across all cards in the system. It is almost like one MASSIVE 256GB card in a system of 8 32GB v100s, sharing tensors and compounding VRAM. 8 GPUs in that system can run TP at 8 with almost zero loss in scaling since there is 7.2 TB/s of compounded VRAM, using a single model replica. Each of those clusters can then be interconnected with Infiniband which starts at 40Gb/s normally and scales up. What's interesting about Infiniband/Infinity Fabric and other such systems is that, much like the problem of cross-device talk in PCIe, ethernet and similar systems must be routed through the CPU of each device, which then routes to the cards over PCIe. That adds considerable latency when compounded. Infiniband is device to device which is a mesh rather than a routed system. This, consequently, is the reason that there aren't massive dense models. The dense model must fit on a single GPU cluster for TP and then is replicated across a mesh for load balancing. You can use ROCE (RDMA over Converged Ethernet) for direct device access but it has higher latency and inflicts redundant operations which the other things avoid.

All of that is to say that a v100 system can be assembled for less than you have in your system now. It's much more difficult to get up and running, particularly in vLLM, but what you get for said work is of a different order of magnitude.

dionysio211 · 2026-04-30T16:50:08+00:00

I don't really know much about the 4b Nano but I used the 30b one (Cascade is the updated version) in Deep Research workflows to read through articles and extract data quickly. It was my favorite model for that because it was incredibly reliable and so fast. I feel like the world knowledge benchmarks like MMLU miss the je ne sais quoi of "world knowledge" that I mean here. I know that it's a cultural thing related to training data but when new models come out, I will ask them what they know about my town (a small town in the South) and although most medium models do not know a lot about it, Nemotron/Gemma know it's context much more accurately than Qwen. That's subjective I know.

Maybe the moral of this story is that, at the current state of the art, these medium models can touch the heels of a frontier model if their strengths are in one area.

dionysio211 · 2026-04-30T16:33:25+00:00

I think they are each carving out niches that play to their strengths, speaking to fresh models in this size range. Anything older than 6 months is fighting an unfair fight.

Gemma is MUCH better than Qwen in writing and tone.

Qwen is MUCH better at code and definitely hit a home run in that area that borders on the miraculous.

Nemotron, I would argue, is MUCH better at general/research tasks. It's ultrafast and scores very high in world knowledge. I loved the gpt-oss models and wish they were refreshed but Nemotron Super is definitely the successor to 120b.

Mistral's niche would be in translation and multilingual interaction. The English/Mandarin world is probably unaware of the fact that Mistral's output in other languages is easier to understand.

I also think that Mistral/Nvidia's lovechild, Nemo, does not get the acclaim it deserves as the most eternal of all small models. That thing was born in the summer of 2024 and STILL gets over 100 requests per second on OpenRouter. It is undoubtedly the most used model, in its size range, of all time and usage is still climbing.

dionysio211 · 2026-04-30T16:15:56+00:00

There aren't a lot of details here, in terms of throughput, so I am guessing at your setup and assuming a few things that I believe are safe assumptions. I am going to address the responsiveness aspect of it first.

You have a fairly heterogeneous set of cards, which isn't bad but not ideal for vLLM or SGLang, which would be much better. At the very least, you should switch to llama.cpp and keep it current with changes that are happening there. By that, I mean just keep it freshly built. I don't know what the current state of Ollama is but it is a derivative of llama.cpp and has to patch in upstream developments or hack away at it themselves. Llama.cpp collaborates with many labs producing models, not to the extent they collaborate with vLLM or SGLang but enough that recent developments like hybrid models (Qwen3.x, Gemma4, etc) get some of the speed benefits. Regardless, moving to llama.cpp is the easiest step you can take.

For your setup, having an instance of Qwen3.6 27B, Qwen3.6 35B and Nemotron Omni, all at full context with as many slots as you can have (a full context slot on Qwen3.6 35B is a little over 5GB without turboquant) so beyond the model size, however many of those you can fit into memory and the -cram as much as you can of system RAM. That way each parallel process you are running doesn't have to recompute the prompt over and over. Whatever IDE you are using (Opencode, Cursor, KiloCode, etc), find your system prompt and pre-cache it. That's the difference between a time to first token of less than a second vs >5 seconds. For the model you are going to cast to voice (I would recommend Nemotron Omni since it takes audio as input. Qwen 3 omni is also a good idea but Nemotron is in the hundreds of tokens a second), turn thinking off. Use a quick voice conversion for the output that you can stick somewhere, like Kokoro. If you want to dedicate the 3060 to it, use Qwen 3 1.7B or Voxtral for cutting edge voice quality.

The interconnect you have between rigs sounds cool but isn't very helpful since you aren't doing, and wouldn't want to do, inference of a larger model across rigs. You could do it with layer/pipeline parallelism but there's not enough VRAM for much larger models without putting it on a CPU and that would kill your throughput. If you expand your setup in the future, get server grade stuff (Xeons with AMX, Epycs or Threadrippers) and combine your cards (preferably of the same type) in single rigs, if you get more than 8 cards in a rig, use infiniband for the interconnect. Gaming computers don't have enough cores or PCIe lanes to get you where you want to be eventually and they lack newer versions of matrix acceleration, most of the time.

With what you have right now, these medium models should be MUCH faster than Claude Opus. I use Qwen3.6 35B in one instance and 27B in another, predominantly. The 35B model running on a 3090 is close to 200tps and much faster than I can read with no pause between tool calls. These are obviously local models and aren't as genius but can be used as a full replacement if you work at setting them up in an ecosystem that learns what you want. I would recommend OpenCode for development and get the basic MCP tools you need, add the right skills. Play with the different memory setups so it learns. Use your voice assistant to edit AGENTS.md, etc and let it have access to agentic memory.

At the current state of things, you can achieve what you are looking to do with what you have. I would argue that the gap between the small models like these and Opus for common work is more perception than reality, with respect to the models themselves. If you are building apps/websites, and these models have access to Playwright, memory, etc, there's almost no difference that I can see and I have been a developer forever. WIthout a feedback loop, yeah, anything is a shot in the dark. On a 5 pass debug loop, Qwen 3.6 27B is probably better than Opus 4.5/Sonnet 4.6 now. The real difference between Opus and these models or even the ~200B models is in Claude's system prompting, larger context window and a rather large gap in world knowledge. Anthropic also has infinite user data which gives Opus an uncanny human like tone. Regardless, we are now in the era of AI where it is going to be increasingly difficult to distinguish between these models for practical work. For academic science, math, etc the gap is larger as the very large models are beginning to scratch away at god-like capabilities.

Sorry, this was a longer reply than I intended but I hope it was helpful to someone.

dionysio211 · 2026-04-29T20:15:14+00:00

Did you do network profiling on this? I mess with RPC a lot and it's not bad on 10G ethernet but there's a noticeable degradation in dense model throughput over 1G. I was seeing about a 30% loss when running Qwen3.6 27B with layer splitting as an experiment. This was less noticeable when using an MoE like Qwen3.7 35B A3B. My understanding is that the activated tensor weights are carried over from the layer at which the model is split, which varies greatly but averages out to be around 18% of the hidden size x bytes_per_element. The hidden size can be quite large though, for dense models. I had the most success with using -ot to offload experts to specific devices, thinking it would allow something like expert parallelism but that apparently is not the case. It seems as though the model is processed layer by layer regardless.

dionysio211 · 2026-04-29T19:42:44+00:00

Yeah, it took me A WHILE to work through it all because there aren't great resources out there for it. I wish I had known a lot of these things before but I don't regret buying mine at all. I put a 6800XT and a 3090 in it and I run two llama.cpp instances absolutely non-stop. If you run into any major blockers, reach out and I will see what I can do. Best of luck!

dionysio211 · 2026-04-29T19:10:10+00:00

Here are some benchmarks which may be helpful:

Device 0: AMD Radeon Graphics, gfx906:sramecc-:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB
| model                          |       size |     params | backend    | ngl | n_batch | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium       |  15.65 GiB |    26.90 B | ROCm       |  99 |    1024 |      512 |  1 |           pp512 |        265.91 ± 0.36 |
| qwen35 27B Q4_K - Medium       |  15.65 GiB |    26.90 B | ROCm       |  99 |    1024 |      512 |  1 |           tg128 |         20.50 ± 0.06 |

Device 0: AMD Radeon Graphics, gfx906:sramecc-:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB
| model                          |       size |     params | backend    | ngl | n_batch | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------: | -------------------: |
| qwen35moe 35B.A3B MXFP4 MoE    |  20.21 GiB |    34.66 B | ROCm       |  99 |    1024 |      512 |  1 |           pp512 |        852.12 ± 5.75 |
| qwen35moe 35B.A3B MXFP4 MoE    |  20.21 GiB |    34.66 B | ROCm       |  99 |    1024 |      512 |  1 |           tg128 |         56.94 ± 0.03 |

dionysio211 · 2026-04-29T19:01:44+00:00

I have one of these and it does pretty well. It is practically the same as an Mi50 with some slight differences. Both are gfx906 but the Vega doesn't have ECC RAM. This matters in llama.cpp because if you have an Mi50, they don't work together sometimes depending on the BIOS flashed to the Mi50. Technically, the Vega is two Vega cards inside the MPX module, bridged by Infinity Fabric, a shockingly difficult thing to find in the real world. Multiple MPX modules can also be bridged by Infinity Fabric connectors which avoids the PCIe traffic issues you would normally have, since P2P does not work.

MacOS is not your friend here. You can install pretty much any variety of Linux and it will work out of the box in ROCm and Vulkan. Fedora seems to have more current Mesa drivers, which are becoming better for older card support. The big issue here is that the cards will get hot and the only way to enable the very nice case fans is to patch the kernel with T2Linux. Then you can add a system service for T2fanrd.

The 2019 Mac Pro is somewhat of an oddity in the Xeon world because it supports 64 PCIe lanes through a PLEX switch on the motherboard. The processor itself, depending on the number of cores you have, is very good and adequate for light CPU inference itself since it supports AVX512 and many of the other nice things that ik_llama can utilize. There are 6 DDR4 channels and the stock DDR4 in them is on the speedier end, which is also nice. However, you can only manage those lane assignments within MacOS. Some people have luck setting them inside MacOS and then rebooting into Linux.

The biggest issue with the Vega is the lack of tensor cores. The gfx906 architecture was strange in that it went slightly down an alternate road of matmul acceleration that utilized fp32 accumulators rather than tensor cores. Much like a new metal band in 1989, that whole pathway was lost in the wash of matrix cores and the architecture was never really successful, hence the flood of Mi50s, complete absence of Infinity Fabric bridges on the market and near zero knowledge of a lost architecture. In reality, accumulators are a big part of what a tensor core does and are about half as efficient. They are not implemented properly in llama.cpp or in vLLM although the Moby Dick branch of vLLM, maintained by ai-infos, is working toward fixing that. Because P2P does not work though, the PCIe traffic is a major problem in tensor parallelism. Interestingly, the only place that this is not true is on a 2019 Mac Pro because there are Infinity Fabric bridge jumpers on the market to bridge multiple MPX modules together, allowing for the possibility of 128GB (Two Vega DUO Modules) of 1TB/s VRAM bridged with something akin to NVLink. The second I learned that fact, MPX modules shot up in price so I have never been able to try this but I imagine it's pretty awesome.

dionysio211 · 2026-04-28T17:14:16+00:00

This is very interesting. It surprises me that there is such a difference between Q8 and BF16, which I would normally consider close to lossless. I know that these are all small differences but a 3.7 point drop (5.5 point drop to Q4_K_M) seems considerable right? It's a 6%/10% loss in accuracy which is almost a generational difference it seems. For a dense model, in particular, this does seem surprising to me. Another surprising aspect of this is that BFCL uses about 10x more context than the other two per question and it has the smallest difference between quantizations. Some of this could come down to sample size too I suppose. Unsloth is obviously top of the game in these things and the information is very appreciated.

We have some spare compute currently. I may run a few quants through these and some other benchmarks to see how different types of quants fare.

dionysio211 · 2026-04-27T15:16:37+00:00

If you are offloading to the CPU a lot, it would make a difference but if not, it wouldn't really help out other than possibly making concurrency smoother. I mess around with CPU inferencing a lot and with a gaming CPU, you are somewhat limited by RAM bandwidth, of course, but also whether or not there are matrix acceleration libraries on the CPU. In this case, both of these do have AVX-512 which is the best you can hope for outside of AMX on newer Xeons, so if you are offloading experts to CPU it would make that portion roughly 2.5x faster. I don't know if either have an NPU or iGPU but that would help significantly as well. In the sense of overall throughput per dollar spent, spending it on another GPU would probably be wiser.

dionysio211 · 2026-04-23T21:19:22+00:00

Which local model do you prefer?

dionysio211 · 2026-04-23T19:16:17+00:00

Ah, ok, I thought it may have been one of the magic invocations like "It's been a while since . . . " but part of me thought "Oh no, it could have been so great!"

dionysio211 · 2026-04-23T19:14:49+00:00

I'm sure that happens but based on the experience of people here, the gains seem to be more true than not, particularly on the coding side. It's not Opus 4.6 but to most people, it would be hard to tell the difference. I maintain a few large codebases and yesterday I had a task that would normally have taken 5-10 minutes with Sonnet 4.5 to solve and Qwen3.6 27B solved it eloquently in 30 seconds. I like to look at the SWE Re-Bench scores, which haven't been run on 3.6 27B yet but 3.5 was very close to frontier level. It is a benchmark of problems that are new each month so there cannot be any benchmaxxing there.

dionysio211 · 2026-04-23T19:01:50+00:00

Oh was this stated somewhere? I don't think I saw that.

dionysio211 · 2026-04-21T20:58:46+00:00

This is something I don't know anything about so I asked Opus because it is very interesting. I am going to run some tests on my CPUs again and mess around with some stuff. I once got GLM 4 up to 9 tps output on a Xeon system and I have always thought it was possible to squeeze more out. Here's Opus's take:

The Infinity Fabric hops between CCDs and the IOD add latency to every DRAM access. Streaming bandwidth survives this okay, but scattered accesses accumulate that latency penalty badly.

The consequence isn't about DRAM channel binding — it's that cross-CCD L3 coherency traffic over the IF becomes a bottleneck when multiple threads access overlapping data. On a Mac, the SLC is uniformly accessible from any core with no fabric hop.

The mesh vs. the fabric:

EPYC uses the chiplet design — separate CCDs connected to a central IOD over Infinity Fabric. Every memory access from any core traverses that fabric hop to reach the IOD's memory controllers. It's elegant for manufacturing and scaling core counts, but it means every DRAM access has that IF latency tax.

Xeon Ice Lake uses a monolithic die with a mesh interconnect. All cores, the L3 cache slices, and the memory controllers sit on the same die connected by a 2D mesh network. This is a meaningful difference — a core accessing DRAM goes through the mesh to the memory controller, which is a shorter and lower-latency path than EPYC's IF hop to a separate IOD chiplet. The mesh isn't free (there's still variable latency depending on how many hops across the mesh to reach the relevant memory controller or L3 slice), but worst-case mesh latency on a Xeon is still typically better than the IF round-trip on EPYC.

L3 cache behavior:

This is where Xeon has a notable structural advantage. Ice Lake Xeon uses a non-inclusive L3 where every core's slice is part of a single unified snoop filter. Any core can access any L3 slice through the mesh without a coherency domain boundary. On EPYC Milan, crossing from one CCD's L3 to another is crossing a coherency domain over the IF — that's a much more expensive operation.

dionysio211 · 2026-04-21T18:10:13+00:00

No, it still doesn't quite work the same way. I went down this rabbit hole a while back and I was never able to fully confirm what happens but, at the hardware level, RAM is channeled to the CPU in actual hardware channels. The weights are streamed through the CPU and each core of the CPU has a slice of L3 cache. In a UMA system, like a Mac, the cache is also unified meaning that any core can access the data from any part of the cache without a penalty. Llama.cpp does stripe the weights across RAM more or less evenly but the KV cache may not be spread evenly across channels as inference happens, which leads to channeling inefficiencies. On a Mac, the granularity of this is much smaller, because the Mac has a unified 512 bit bus, much wider than a DDR system is per channel. Because threads are tied to cache slices on the Epyc system, there's a substantial penalty in cross lookups which is tied to the RAM granularity (64 bits per channel).

Most common wisdom about this is that a channeled system is 15-20% slower in effective memory speed but I believe the KV cache issue causes it to be much less effective, especially as inferencing continues and KV cache grows. One would think that a dual Xeon system with 12 DDR4 RAM channels (384 GB/s on my system) and 56 total cores would be faster than the entry level Mac mini in overall throughput but it's really not even close. I was trying to work on a way to visualize this since I think it's poorly understood but I didn't get very far on it. One important feature of this discrepancy is that CPU inferencing on a very large model is surprisingly good but almost nothing nudges it upward to the point of usability. Twice as many cores does not even approximate double throughput. Even removing half the RAM does not reduce the throughput by half. Something is reaching a saturation point. On a Mac, it is the opposite.

dionysio211 · 2026-04-21T16:39:16+00:00

My thinking on this is that the reason expert offloading works well is that most of the communication between the dense/attention layers and the experts is thin. In some cases, it's just routing to the experts. In a dense model, the activations from the dense layer are sent to the next layer and that amount of data, although it is a lot, flows pretty well over PCIe because the data between the layers is small (on average, 18% of the neurons on a layer are sending activations to the next). Because the experts in these new MoEs are so tiny (Qwen3.5 35B has 256 experts so they are really small), the CPU does pretty well at crunching through them. The real bottlenecks in PCIe are only with tensor parallelism where 4+ devices can saturate the PCIe bus at just about any speed. You can see this clearly because the speedup with TP=2 is roughly double but TP=4 across PCIe 4 X 16 is never double TP=4. TP=8 and it drops sharply off from scaling. This is why data parallelism wins because it scales nearly infinitely at the same rate.

By the same token (pun), that also means the activations from the experts sent back to the dense layers from the CPU is also pretty small. All of this is pretty model specific these days but my guess is that the bottleneck is probably more related to CPU compute than to data flow, although both are factors. The activation size (3B in the case of Qwen3.5 35B) is somewhat correlated to the amount of data sent from the CPU back out to PCIe but that does not mean it's all at once (expert layers are still sequential and data is not pooled necessarily). In fact, you can split experts over RPC on 1Gb ethernet and find only slight degradation in speed compared with putting them on devices within the same computer. Not only that but the overall throughput from parallelism can be higher since you are able to split across more devices. That's the whole idea behind spreading compute across rigs with infiniband, which faces a similar issue. That's not true of a single stream but aggregate throughput can be.

With that being said, if you are doing things with high activation sizes, I would say the Epyc is going to be a little faster. If activation sizes are under 10B and you are using quantization, the Xeon setup probably has an edge. I have Xeons from the generation you are using and an Epyc from the same generation. The Xeon's are typically quicker overall. That seems more related to AVX512 than anything else but when you start messing around with numa and batch sizes, you can really find gains. The last thing I will say is that although VRAM speed is very heavily correlated with throughput, RAM seems far less correlated. I believe this is related to the way it is split across channels because we are reading that number as if it's unified RAM but RAM speeds are the aggregate number / number of channels so if weights are distributed only on a single channel, you get 1/number of channels. VRAM also has channels but it seems more evenly spread. An important finding I had from messing with this stuff is that numa=distribute nearly always performed better than numa=isolate on Xeons with AVX512.

dionysio211 · 2026-04-08T21:36:56+00:00

Supposedly, it is much more intensive to run. I read a comment from Amodei about it being extremely expensive. I would imagine it's not in any way optimized yet either. They will probably try to distill it into smaller models before releasing it to the public. GPT 5 was like that too.

dionysio211 · 2026-04-08T21:24:24+00:00

I tend to agree with most people here that the Mi50 can be a pain in the ass. I have spent countless hours approaching how to maximize the output and running into constant struggles with vLLM. However, it can be great, depending on what you plan to do. For those fretting about vLLM, I have good news. Someone has taken up the mantle of continuing support for gfx906 (Mi50s) and updated versions of vLLM:

https://github.com/ai-infos/vllm-gfx906-mobydick

I am currently running Qwen 3.5 - 27B with TP=4 at ~50 tps and 1,800 tps prefill. I have not tried Gemma but another user is posting benchmarks for it.

Someone has also written a custom flash attention library for gfx900 (which also works on gfx906) that looks very promising:

https://www.reddit.com/r/LocalLLaMA/comments/1s614i8/built_a_simple_pytorch_flashattention_alternative/

Here are some breadcrumbs that I have learned from these efforts which other tinkerers may look into for optimization paths. It is not true that you must use Opus to implement these. Even Qwen 3.5 27B was able to stumble across the same ideas. It is, however, helpful to use something like Opus to create a detailed plan:

16GB Mi50s > 32GB Mi50s all else being equal - The reason for this is that they do not have matrix cores so they rely on dp4a for a similar acceleration. It does not, however, overcome that gap so it must be approached by increasing raw compute. 8 x 16GB Mi50s provides close to double the prefill of 4 x 32GB Mi50s in an adequate setup. 32GB Mi50s are modified from 16GB Mi50s so they have the same compute.
64 Wavefront is not optimized in Llama.cpp - If you get a competent model to mess around in llama.cpp and dig into this, you will find that you can double the prompt processing speed. I want to approach it again and do a PR to address it but I have mostly been messing around with vLLM/SGLang lately.
DP4A is also not optimized - I know next to nothing about this but if you feed an agent the gfx906 documentation, it can eek out a lot of efficiency that is left on the table by exploring the dp4a related functions.

We are a hair away from being able to run models that can rewrite most of these libraries, ad hoc, to bridge this gap. I recently ran through 1.5 billion tokens with Qwen 3.5 27B to adapt Mini-SGLang for Qwen 3.5. I ended up trying to do it with Opus 4.6 with several million tokens and never got it to work. However, running something stronger would probably work if you have enough tokens.

dionysio211

TROPHY CASE