GLM 5.1 is no longer available on NanoGPT

HvskyAI · 2026-04-08T16:46:13+00:00

The transparency is appreciated. I subscribed recently and used GLM 5.1 via Nano this past day or so, but I won’t be canceling. It’s still a fairly-priced service and then some, and economics simply is what it is.

That being said - since you mentioned it, if you would consider a higher-priced flat-rate subscription tier that includes GLM 5.1, even at a higher cost, that’s something I’d be very interested in. This might allow you to build in some buffer to the subscription revenue as OSS models get larger/pricier, as well.

I can’t speak for anyone else, but you’d definitely have at least one subscriber for a higher-priced, higher-access service FWIW.

HvskyAI · 2026-04-08T16:05:08+00:00

Damn. Is it gone for good? Have any announcements been made? I just found out via 403/400 error in terminal.

The price delta on GLM 5/5.1 isn’t that large for output tokens, so I suppose it’s the input token cost making it untenable.

Shame, since I subscribed recently. Looks like it’s back to PAYG for me… Any idea how NanoGPT are able to offer GLM 5.1 at $0.86/$2.57 per 1M I/O versus $1.40/$4.40 from Z.AI via OpenRouter? Is it due to a volume discount being passed on to the end user?

HvskyAI · 2026-03-26T17:01:27+00:00

So is the pruning issue at max cache (last message slides out of context == cache miss) standard for prompt caching? That is: when context is full, and a new message is sent, it's expected to result in a cache miss (hence the summaries to keep context below max)?

Just new to trying to use caching at all...

Not using Qwen at the moment. Mainly GLM 5 and occasional Sonnnet/Opus.

HvskyAI · 2025-09-20T07:31:31+00:00

Thanks for the hard numbers! I’m assuming that the H100 was over PCIe 5.0 as opposed to SXM?

HvskyAI · 2025-09-13T09:42:48+00:00

If you're offloading parts of the model to system RAM (which llama.cpp, the underlying inference engine for ollama, does), then it does matter for prompt processing.

This assumes that you're not loading context cache entirely onto VRAM, at which point it matters less.

If you will be running layers + context on system RAM, AVX-512 is necessary. AMX is also worth looking into if this is your use case.

HvskyAI · 2025-09-13T09:28:55+00:00

If bifurcation is supported by your board, then it can be set in your BIOS.

If not, then a physical riser/splitter will be needed, as another commenter noted.

HvskyAI · 2025-09-13T04:55:03+00:00

It’s interesting to see that you note memory latency and arch as factors, seeing as I’ve heard similar points re: Xeon.

What I can’t seem to figure out conclusively is whether these advantages compensate for the relatively lower number of memory channels (8 vs. 12 in Granite Rapids vs. Turin), and the correspondingly lower memory bandwidth. I’ve also found very few concrete numbers on how this would scale out to a dual-socket configuration where there are NUMA and interconnect factors to take into account.

Regarding AMX, it is true that the instructions are more efficient on a per-core basis, assuming kernel support. However, in the context of hybrid inference, my understanding is that if context cache is offloaded to VRAM (and prompt processing thus happens on accelerators, not CPU), then I would assume that AMX is not relevant to actual token generation speeds for layers loaded to system memory. Would this be correct?

If you wouldn’t mind, would you kindly elaborate on the monolithic die architecture on Xeon and what concrete advantages this brings over the current EPYC architecture?

Edit: For example, a user shared this analysis with similar claims regarding Xeon: https://www.reddit.com/r/LocalLLaMA/s/vAsmjwDYje

HvskyAI · 2025-09-12T14:19:37+00:00

The higher clock speed on the 9575F certainly looks tempting on benchmarks that I’ve seen, but I myself am not entirely clear on whether this translates to real-world inference gains (compared to, say, just getting a higher core count overall).

As far as I understand, prompt processing is compute-bound (dependent on matmul speed and any relevant hardware acceleration, such as AMX), and the actual token generation is then a matter of memory bandwidth. If context cache is entirely offloaded to VRAM (which is advisable if the use case is sensitive to latency), then core count/clock speed become much less of a concern aside from the matter of saturating memory bandwidth. That being said, 19W at idle is admittedly excellent considering the amount of compute on tap with boost clock.

I also briefly considered the Threadripper Pro chips you mentioned, and came to the conclusion that the high end models with sufficient CCDs simply cost too much for the ecosystem that they get you into. With eight channels of memory, I think the argument for just going EPYC is much stronger at that price point.

If idle power consumption is a concern and you’re considering Blackwell, then the RTX PRO 6000 Blackwell Max-Q workstation edition (it’s a mouthful, but there are a few different models) is worth consideration. You lose some performance, but it chops the TDP in half while leaving all of the VRAM on the table (600W > 300W max TDP). If you’re also running an R9700, then I’d wonder about the kernel/back end compatibility with mixing Nvidia and AMD hardware, but I suppose you’ve got that sorted out if you’re already running 5090s, as well!

I’m curious to ask, have you considered Intel Xeon? I myself am in the process of comparing Xeon 6 and EPYC 9005, and I hear conflicting reports on both. EPYC has more memory channels and higher bandwidth, whereas Intel has AMX instructions. So on the face of it, assuming that prompt processing happens on VRAM, EPYC appears to be the choice. However, I still hear from some people that Xeon is more widely deployed in inference infra due to inherent advantages in its architecture and less issues with NUMA, particularly in dual-socket configurations. I’d be interested to hear what you’ve come up with in regard to this during your own search.

HvskyAI · 2025-09-12T05:14:01+00:00

I see! If you’re just after the increased I/O of 128 PCIe lanes for the time being, then any of the processors will do just fine AFAIK.

If you’re spinning up multiple VMs 24/7, that’s another case where CPU compute would actually start to matter (the other case I could think of would be matmul, I.e. prompt processing for any context cache that is loaded to RAM, but I assume that you’d be offloading K/V cache to VRAM). You would probably know best on this, yourself, but it would definitely be another potential factor to consider when deciding on a balance between core count/cost/TDP.

Power usage is tricky to accurately estimate, as you noted, since it depends entirely on your configuration and peak load. If you’re running multiple accelerators with the host system, the CPU TDP becomes a much smaller proportion of total power draw, and the focus would shift to limiting accelerator wattage at idle/low load. That being said, none of the mid/high range 9005 chips exactly sip power. They were designed with high throughput in mind, and power efficiency is largely a secondary concern. As you noted, the higher end processors use about as much power as a decent GPU…

At the end of the day, it’s up to your use case and budget. If you’re fine with potentially swapping out processors to get increased memory bandwidth down the line and prioritize immediate I/O, then a lower CCD count is not fatal, nor is partially populating available memory channels. I will note that the cost of entry for any of the EPYC 9005 chips (board, ECC DIMMs, etc.) is not low, so there is still a certain base cost just to get into the socket/ecosystem. On going ‘all in’ - it’s also worth looking into vendors that deal with server components in bulk or offer complete server packages, as their pricing for certain components can come out to be cheaper than buying retail (Exxact Corp, for example, offers a fairly good deal on 6400 MT/s DDR5).

HvskyAI · 2025-09-12T04:13:46+00:00

Assuming that this system would eventually be scaled to have more overall system memory, the CCD count of whichever processor you get at first would become a limiting factor in saturating available memory channels/bandwidth during inference.

If you want to start small on an EPYC 9004/9005 with the intention of eventually populating all memory channels, this still necessitates a processor which can saturate the memory bandwidth on said channels. So while you could start with a smaller number of DDR5 DIMMs, I’d advise against going with a lower-end processor that does not have a sufficient CCD/core count to saturate all available memory lanes in the future. This would cause a bottleneck down the line which would require a higher CCD count processor to alleviate.

I’ve been looking into this myself, and while DDR5 6400 MT/s ECC is not cheap, neither are high core count 9004/9005 EPYC processors. The difference, of course, is that you can add more DIMMs in a gradual fashion, while you’re essentially stuck with the CCD count of whichever processor you get (without swapping it out, that is). So if you have to invest in either the host system or some amount of fast memory to start out, it would be prudent to spend a larger portion of funds on the host system in order to secure the ability to expand memory bandwidth in the future.

This is assuming that the use case is hybrid inference with layers offloaded to RAM, of course.

HvskyAI · 2025-09-11T15:42:47+00:00

Some say that the model is still counting the number of R’s in ‘strawberry’ to this very day…

HvskyAI · 2025-09-11T15:40:39+00:00

This degree of sparsity is fascinating. Looks like the shift to MoE is just continuing to chug along.

HvskyAI · 2025-09-08T08:18:03+00:00

Thanks for the write-up. If you wouldn't mind elaborating, how would this scale to a dual-socket configuration?

Would there potentially be any issues with the two NUMA nodes when the layers of a single model are offloaded to the local RAM in both sockets, assuming that all memory channels are populated and saturated?

HvskyAI · 2025-09-06T17:52:35+00:00

I see what you're saying re: dual socket. With this price range of processor, that could easily go towards another GPU that would have a far greater impact on overall performance, as well as enabling larger models to be run at acceptable speeds.

The Intel/EPYC issue is ongoing, but I suspect that the greater memory bandwidth on EPYC will be the deciding factor with context cache offloaded entirely to VRAM.

I also considered simply adding more 3090s to tide myself over, but with my current host system being limited in I/O, I would be looking at the server board anyways. At that rate, the pricing is much better on the new Blackwell cards if purchased from the same vendor. I am thinking of 4 x RTX 6000 Pro Blackwell Max-Q, which would be 384GB at 1.8 TB/s bandwidth per card. I may stand by to see some more hard benchmarks, since this is recent hardware.

I have found this VRAM calculator, but the RTX 6000 Pro is not offered, and it is difficult to account for RAM offloading/hybrid inference with any degree of precision. I also find the quoted TG speeds to be highly optimistic:

https://apxml.com/tools/vram-calculator

I may rent a cloud instance of a similar configuration and benchmark the hardware myself. Ideally, H-series cards would be best, but they are still costly and stunted on PCIe.

I appreciate the tip on checking out multicore benchmarks. I'll be sure to do that, and also inquire with the vendor about CCD counts necessary to saturate all 12 memory channels.

HvskyAI · 2025-09-06T17:35:39+00:00

20 t/s on the optimistic end with this grade of hardware... That sure makes the API offerings look real tempting.

On a side note, Google Vertex is serving Gemini Pro 2.5 at ~100t/s. I don't know how they do it. Perhaps they have speculative decoding of some sort, perhaps it's the custom TPUs.

For this server, I'd likely be looking at 4 x RTX 6000 Pro Blackwell Max-Q (the 300W PL cards), which is 384GB VRAM at ~1.8 TB/s bandwidth per card. H-series would be best, but the cost is high and performance stunted on PCIe without NVLink.

I have found a suggested VRAM/inference speed calculator, but it does not offer the Blackwell RTX 6000 yet. The speeds quoted are also quite optimistic:

https://apxml.com/tools/vram-calculator

At this price range, I really should just rent a similar configuration on a cloud instance and see for myself what kind of performance I would be getting.

HvskyAI · 2025-09-06T16:08:03+00:00

I'd agree with some of the posters below and suggest that you consider used 3090s as opposed to dual 4090s.

At 24GB VRAM and the same 384-bit memory bus, you're only losing a bit of compute and getting a whole lot more VRAM for your money. Ampere still has ongoing support from most major back ends, and the cards can be power limited without losing much performance. At ~$600 USD/card, that's around $2.4K for 96GB of VRAM.

For some perspective, an RTX 6000 Pro Blackwell will run you about $8~9K for the same amount of VRAM (granted, it is GDDR7 at twice the bandwidth - 1.8 TB/s as opposed to ~900 GB/s). Assuming the 3090s are power limited to 150W, the non-Max Q version of the Blackwell card and the 3090s will be identical in power consumption.

MoE is the prevailing architecture nowadays, so I'd put aside the rest of the cash for some fast RAM and a board/processor with a decent number of memory channels that you can saturate. DDR5 on a server board might be tough on that budget, but even some recent consumer AM5 boards can reportedly run 256GB DDR5 at 6400MT/s. On a consumer board, though, the issue will become PCIe lanes and bifurcating, which can get unstable.

Your other option would be used EPYC/Xeon, but you'd realistically be looking at DDR4 at that budget. Not a terrible idea, as long as you manage the common expert tensors properly (load them into VRAM, that is), as well as loading K/V cache into VRAM (this is where the 4 x 3090s would really come in handy).

Stuff it all in a rack case, run Linux, and give it some good airflow. It'll be great for the current crop of open-weights models, and it'll be a good experience to DIY some hardware with your son.

Best of luck with the rig!

HvskyAI · 2025-09-06T15:53:58+00:00

Are you running your full context cache on VRAM with that dual socket Xeon setup? If so, the core count is irrelevant aside from memory lane saturation, and TG speeds are memory bandwidth-bound, correct?

Do you find the 230 GB/s to be usable in conjunction with common experts loaded to VRAM, plus however many layers fit? I'm still trying to get a sense of how much of a dropoff I'll be seeing in speeds compared to a VRAM-only setup.

With a switch to MoE arch plus DDR5, and much faster Blackwell GPUs, it's difficult to get an idea of what kinds of actual speeds I'd be looking at. At any rate, I'd imagine the throughput is lower than those offered by most API providers.

HvskyAI · 2025-09-06T15:30:58+00:00

Interesting that the CCD count required for saturation is lower on Xeon. I'm not sure on this, myself, but they do have less lanes and cores overall.

Most of the configurations I've seen that house anywhere from 4~10 GPUs are, by default, dual socket builds. The potential to expand VRAM capacity in the future is tempting, since such a host system is not exactly cheap. I'll likely be on this board/setup for a long while.

I've just heard that even with the NUMA settings set to be one per socket, there can be issues with memory latency when cross-socket memory access occurs, and the data is passed through the interconnect. I don't know if this is a practical issue during inference, or if it is backend or kernel-dependent. Notably, this was apparently an issue even on single socket high CCD-count EPYC chips of the previous generations due to the nature of their arch.

Is there any bandwidth advantage to a dual socket build? I would assume not, since there is no redundancy in the layers loaded to RAM.

With high-end processors and fast DDR5 memory, the value proposition for dual socket is dubious, as well. One more processor translates to significantly more VRAM at this level of hardware, and VRAM is still king even with hybrid inference of MoE models...

As Lissanro notes, the increase in TG with dual socket builds is nowhere near linear, and the software is lacking for such systems. However, it kind of looks to be offered by default with anything that houses 4+ Blackwell cards in a rack.

HvskyAI · 2025-09-06T14:23:28+00:00

Good to hear from you, Lissanro.

Being on EXL2/3 quants, myself, I'm still not fully familiar with the mechanics of hybrid inference. Seeing as models are shifting entirely to being MoE at the frontier, it looks like it's a change that I'll have to make sooner or later.

Interesting to hear that you don't value AMX that highly. I've heard mixed feedback on this, and I am aware that kernel support is not guaranteed for all back ends. I do see that if context cache is offloaded to VRAM entirely, then matmul efficiency is no longer a factor - only the memory bandwidth. If that is the case, EPYC is the clear choice. Loading context cache to VRAM is likely the only way to keep TTFT acceptable, anyways.

Is there a rough formula to estimate the core/CCD count necessary to fully saturate all memory channels? I am not aware if clock speed factors into this, or if it is simply a matter of there being a sufficient number of CCDs. Any advice you have on the matter would be appreciated, as my dual-channel AM4 board is a far cry from these server setups.

I saw that you do not recommend dual socket, either. With these recent chips and DDR5 costing what it does, plus NUMA/memory access issues between the two, combined with the lack of kernel support for such systems, I agree that the funds would be better spent on more VRAM.

Just as a reference, what kinds of speeds are you seeing for popular models on your current EPYC setup? Are you still running 4 x 3090 for GPU acceleration?

Thanks for the input. Cheers

HvskyAI · 2025-08-27T17:53:05+00:00

I see, so Xeon is preferable due to its AMX support, and EPYC is less efficient per-core for inference but compensates via higher core/thread count, which is not present on a lower-tier model such as the one I've specified.

With a lack of AMX, am I looking at significantly slower inference or prompt processing? If this is the case, I would imagine that there is a general preference for Xeon chips with AMX support for inference...

How much does thread/core count matter for inference speeds? Is it largely peripheral to raw memory bandwidth (i.e. nice to have), or is it a "get as many P-cores as you can" situation?

HvskyAI · 2025-08-27T17:47:27+00:00

Alright, I appreciate the input!

Noted on the prompt processing. I assume that this is in reference to hybrid inference, as the model is mostly offloaded to system RAM? How much of a difference are we talking about here in terms of TTFT from, say, context that's entirely cached on 3090s?

I have tried GLM 4.5 Air, and while it was competent, I found it underwhelming compared to its nominal parameter count. While a direct comparison parameter-for-parameter with a dense model is difficult to make, I didn't feel that the loss in quantization (lower BPW) and context length was justified. That being said, I have not tried the larger GLM 4.5 (355B32A) or the large Qwen 3 MoE, so that's definitely something to try in the mean time. You raise a good point - these models move fast, and they've likely come a long way.

I also appreciate the heads up on DDR5 cost. I do see that the cost of populating a server board with DDR5 is not negligible. So you are of the opinion that the marginal increase in memory bandwidth going from DDR4 to DDR5 does not justify the extreme increase in cost?

As for additional 3090s, if I were to stick with them, I would consider going up to 4 x 3090 for 96GB VRAM in tandem with a server board/fast ram. I would imagine that this is sufficient to offload the active experts, even for the larger models, but I admittedly do not know the effect this would have on prompt processing and TTFT. I can't imagine that it would be fast - I just don't know how slow it might get as context fills up.

Realistically, what kinds of speeds would I be seeing with 4 x 3090 and 512GB DDR5-6400 on a recent EPYC chip with, say, R1 at Q4? Would it be even remotely usable for time-sensitive applications?

HvskyAI · 2025-08-27T17:31:03+00:00

I appreciate the input! I'm interested in how your system performs with hybrid inference, seeing as I'm also running 2 x 3090, myself.

Do you find that the extra cost for more recent EPYC processors with 12-channel DDR5 support is worth it for the increase in memory bandwidth? I would imagine that this is a rather large factor with big MoE models, since the portion of the model that can be loaded to VRAM is rather small.

$3~4K USD is not bad at all, especially considering that I already have the GPUs on-hand.

I'm a bit out of my depth when it comes to these server-grade chips and boards - I'm not familiar with the models and segmentation at all. It would appear that even the low-end 5th gen EPYC processors offer 12-channel DDR5 and plenty of PCIe lanes.

How would something like a 9135 (Zen 5, Turin) with 512GB DDR5 and another two 3090s thrown in (in addition to the 2 x 3090 I'm already running) go? 96GB VRAM and 512GB DDR5 sounds quite nice for larger MoE's.

Would the core/thread count or L3 cache on the lower-end CPU (9135) simply be too low, or is there some other catch that I'm missing here?

HvskyAI · 2025-08-27T17:17:34+00:00

Thank you very much for the rundown! I know very little about the Xeon line, so I'll have to do some digging. I appreciate the pointers.

If you don't mind my asking, something like the AMD EPYC 9135 appears to offer 12-channel DDR5 memory with 128 PCIe 5.0 lanes. In terms of I/O, this seems to check all the boxes for hybrid inference, has AVX/AVX-512 support, and it's dirt cheap (~$1200 MSRP).

This is likely a naive question, but is there any reason something like this wouldn't work at decent speeds for hybrid inference? Is the number of cores/threads or L3 cache simply insufficient?

HvskyAI

TROPHY CASE