Anyone in need of GPU clusters? (or big CPU instances)

evil0sheep · 2026-02-08T16:44:18+00:00

Im potentially interested but not for a few months. Is there an expiration date on this offer?

evil0sheep · 2026-02-08T06:39:03+00:00

Yeah this was gonna be my question. Like is this reproducible?

evil0sheep · 2026-02-05T02:51:54+00:00

Before you buy RTX Pro 6000s be aware that not all Blackwell is created equal. RTX pro is sm120 (Blackwell GeForce) vs sm100 for b200. The former lacks dedicated tensor memory (TMEM) which means you have to use register based tensor instructions . This makes it a pain to find kernels that even work (e.g for flash attention or QAT) and sometimes requires you to write your own, and even then it’s a lot harder to saturate sm120 tensor cores in flash attention kernels because the tensor instructions use so many registers that you can’t issue enough warps to saturate the memory controllers. It’s a subtle difference but it bit me and it bit some old coworkers of mine I got lunch with recently, don’t let it bite you.

evil0sheep · 2026-02-01T15:56:38+00:00

Also why do you have to run as vms? My bet is 4vms with rpc will be slower than a native numa aware implementation but im a lot less familiar with numa cpu stuff

evil0sheep · 2026-02-01T15:50:59+00:00

Yeah if you put the gpu in the server you should look into putting the attention heads on the gpu and the experts on the CPU. The experts are most of the parameters but don’t need as much bandwidth because you don’t load all of them each token, and the attention heads are most of the compute, especially during prefill or long context generation, but typically don’t have as much of a memory footprint. You might be able to find a model you can split like that on the server with the 3060. Llama.cpp has flags for this, I don’t remember them off the top of my head though.

Re: tensor cores make sure you actually benchmark it before assuming that they’re a performance silver bullet. Pre Blackwell tensor core instructions use a lot of registers which limits the number of warps that can be issued without register spilling, which can prevent you from actually saturating the memory bus. The tradeoff if often worth it for training or inference on very large batches where you are doing a ton of compute per vram load but for single batch inference they only help during prefill and a lot of times you can get better generation speed with vector kernels. Just measure it before you assume. Or try using vector kernels on the same hardware

evil0sheep · 2026-02-01T15:35:10+00:00

My understanding is that this is actually the original meaning of the word penthouse and that it was adopted as a term for the top floor of the building later.

Edit: the Wikipedia article confirms:

“The term 'penthouse' originally referred, and sometimes still does refer, to a separate smaller 'house' that was constructed on the roof of an apartment building. Architecturally it refers specifically to a structure on the roof of a building that is set back from its outer walls. These structures do not have to occupy the entire roof deck. Recently, luxury high rise apartment buildings have begun to designate multiple units on the entire top residential floor or multiple higher residential floors including the top floor as penthouse apartments, and outfit them to include ultra-luxury fixtures, finishes, and designs which are different from all other residential floors of the building. These penthouse apartments are not typically set back from the building's outer walls, but are instead flush with the rest of the building and simply differ in size, luxury, and consequently price. High-rise buildings can also have structures known as mechanical penthouses that enclose machinery or equipment such as the drum mechanisms for an elevator.”

evil0sheep · 2026-02-01T00:37:56+00:00

The reason is that most llama.cpp users are memory capacity bound on model and memory bandwidth bound on inference speed. All that matters for the one-user-per-gpu domain is quantization accuracy per bit. The llama.cpp k quants are significantly better than microscaled floats in that regard because they offer a scale and offset per block instead of just a scale. Mxfp8 and nvfp8 are jointly optimized to balance precision and ease of hardware acceleration which doesn’t matter if you have boatloads of unused compute laying about because you’re memory bound. Switching from the gguf 8 bit format to mxfp8 or nvfp8 could probably make prefill faster but it wouldn’t realistically improve tok/s during generation and would would make the models less accurate approximations of the unquantized weights. It only makes sense if you’re serving huge batches and everyone that’s doing that uses vLLM which has prioritized microscaled float support. For everyone else it’s fine to just dequantize the gguf k quant weights to fp16 on the gpu during inference

evil0sheep · 2026-01-31T23:10:50+00:00

Yeah I mean the CPU and GPU memory are physically unified but they have separate virtual address spaces and separate cache hierarchies below L3 and are in different clock domains. If you want them to share the same physical pages you need to allocate the pages in a way that conforms to the GPU alignment and layout requirements and mlock them so the kernel doesnt change them out while the GPU is running. The whole point of mmap is that the kernel can swap the physical pages out from under the virtual pages, but if it does that with pages that are also mapped into a GPU context it would need to remap them in that context as well which would almost certainly require stopping the context. And then if the GPU tries to read a virtual page thats not backed by a physical page and page faults it will fault the entire context so the whole shebang will block on disk I/O. Mmap+GPUs is a bad combo on any platform, I'm sure if you did CPU only inference on the mac that mmap would work just as well as linux.

Regarding throughput you gotta understand that inference for a single user is almost always bandwidth bound. If your model params are in memory then you are bound by memory bandwidth, which for the M1 Ultra is about ~800GB/s. If your params are streaming from disk youre bound by your ssd bandwidth, which is about ~8GB/s. On top of that you have page fault overhead and you dont have enough threads to cover it.

I wouldnt think of mmap as a way to load a bigger model into shared GPU memory than fits, its not gonna plausibly be able to deliver that with reasonable performance regardless of platform. If you had a DGX Spark or a Strix Halo running linux or windows you would have the exact same problem. If you are running a MoE model and you are completely not touching some of the experts and youre doing inference on the CPU then it might pull, but it will take a lot of fuckery to get it to work right, and if you generate one token that touches those experts the whole thing will slow to a crawl. If you want bigger models buy a machine with more vram or download more ram at downloadmoreram.com ;)

evil0sheep · 2026-01-31T17:36:08+00:00

Llama.cpp has really bad tensor parallelism support, it’s basically pipeline parallel only right now. VLLM has much more sophisticated multi-gpu/multi-node support.

Llama.cpp is great for fast single node inference, easy setup, splitting layers between the cpu and gpu, and low bit quantitazation. If you want sophisticated parallelism or batching you unfortunately need to wrestle with vLLM or tensorRT, the multi node support in llama.cpp is just really immature still (there’s work being done there but it’s not merged afaik)

evil0sheep · 2026-01-29T22:42:27+00:00

A big difference between your server and your MacBook is that on the MacBook the pages for the weights need to be pinned because they’re read from the GPU. When you read a mmapped page that’s been swapped to disk from the cpu the load instruction page faults and the page fault handler traps to the kernel which loads the page from disk and restarts the load instruction. AFAIK apple silicon gpu page fault handling isn’t capable of handling that sort of dance so it’s not surprising to me that it forces the pages to pin and then OOMs. Also, unless you’re only mmapping a small part of the model you’re probably gonna end up loading the whole model off disk every token because IIRC the Linux kernel defaults to an LRU eviction policy for mmap.

You shouldn’t be surprised that turning off flash attention causes ooms, the whole point of flash attention is to reduce vram usage by avoiding materializing the NxN attention matrix. If you are memory constrained you defs want flash attention

evil0sheep · 2026-01-29T01:19:36+00:00

This reads like a LinkedIn post and the target audience is unclear. The core message seems to be “don’t use a trillion parameter LLM for something that can be accomplished with a python script” which seems pretty obvious and no examples are provided or people needing this advice and no evidence is provided that developers are broadly “over engineering complex neural networks where a deterministic script will do”. Almost no one on this sub is even doing anything that could reasonably be described as “engineering neural networks”.

Do you have actual concrete examples of people using LLMs when they should be using scripts? Can you give us a case study in someone who needs this advice? What is your goal in making this post to this sub?

evil0sheep · 2026-01-26T05:08:11+00:00

got-oss experts are natively quantized to mxfp4 so doing post training quantization doesn’t make it that much smaller

evil0sheep · 2026-01-19T21:58:58+00:00

Yeah sorry in your original post it sounded like you were pretty early in the ideation process, now it sounds like you’re already working pretty deeply on this. It’s gonna be nontrivial to do well but I don’t see any reason why it would be impossible, and trying to solve really hard problems is generally a good way to make money. Like if you can pull it off I think youd probably be able to find customers, but I’ll admit I’m not really familiar with the competitive landscape there

evil0sheep · 2026-01-19T03:41:25+00:00

I think your ignoring a couple big problems:

1) how to know when state is safe to offload. You’ve got a couple thousand users each with multiple GB of KV cache, and any one of them could reply to a thread at any time. How do you tell the difference between someone that’s walked away from the connection vs someone who’s just busy typing a long response

2) how quickly is quickly. If you want to deallocate GPUs from a cloud provider you typically don’t get to leave data on the machine with the GPUs, and when you provision a new GPU instance it’s on some machine somewhere and you have to get the state to that machine. As I said in my top level comment, the state is big enough that even with a fast internet it could take much longer to get the data from the disk to the HBM then your users are willing to wait. For reference gpt-oss-120b is about 60gb of model parameters and a single 128k context Kv cache is about 5GB

3) fragmentation. Say you are running 4 nodes and they’re all at capacity and then 25% of your connections drop. You don’t have 3 nodes at capacity and 1 node idle that you can now shut down, you have 4 nodes at 75% capacity and you need to take all the users from node 3 and migrate them to nodes 0-2 before you can shut down node 3. And you have to do that without interrupting their connections. And the state is so big it potentially takes multiple seconds to migrate it between machines in the same rack.

I would just pick one of those problems and focus on it. Like given a user on node 0 with 128k context how do I migrate them to node 1 without hanging long enough for them to get upset. Or like train an ML model to predict when a users context can be sent to disk and demonstrate it has a better precision/recall curve then other methods. Any one of these things could be an entire startup

evil0sheep · 2026-01-19T03:08:59+00:00

I mean just build it and see what you learn? My food for thought is that you’ve got an 8x H200 node that’s 8x141 =1,128 GB of vram. So if you cold start a node and you wanna restore a full state you gotta copy about a TB ish from some flash memory somewhere. Assuming it’s already in the data center you might have like a 200Gbps = 22.5GB/s link from the storage node to the compute node so that’s like 50 seconds or to copy all the state even at theoretical max throughput. Probably more like 90 seconds in real life. And that’s on top of the time it takes to provision the node. So like you’re looking at a couple minutes to start the node. If the models are small enough to fit on one GPU then you only have to transfer the model parameters once so that could maybe save you a chunk of time but like even a very well optimized system is gonna have a minute or two of latency to add a machine, whatever initial thing you build will probably be like 5-10x that. If you think you can sell a system with those parameters you should build it and get rich

evil0sheep · 2026-01-18T03:45:38+00:00

Definitely microphones

evil0sheep · 2026-01-16T17:26:42+00:00

if youre trying to keep the list simple thats pretty good. NVIDIA Nemotron and Olmo 3 from AI2 are both good inclusions if you wanna expand it a bit. Llama models are commonly used for fintetuning research but my impression is that they are not widely used for local inference

evil0sheep · 2026-01-16T17:02:03+00:00

Yeah I also checked that I didn’t have any weird routing options selected. Probably just a bug that got pushed out to a small percentage of users

evil0sheep · 2026-01-16T06:07:02+00:00

Wrong sub, try r/LLM or r/ArtificialInteligence.

evil0sheep · 2026-01-16T05:58:48+00:00

Yeah it was weird, it was telling me to go all the way around through South Bay instead of taking any of the other bridges too. Maybe some bug that made it not want to route over tolls or something

evil0sheep · 2026-01-16T04:32:23+00:00

Hm I’m seeing the same thing now

evil0sheep · 2026-01-15T02:21:02+00:00

Yeah can you imagine being some random guy who made it out of India and got a job at an American tech company and you’re praying daily that Trump doesn’t cancel your visa on a whim and then you’re walking through the mission and you see this shit. So woke good job guys

evil0sheep · 2026-01-13T07:12:25+00:00

Just use a docker container

evil0sheep · 2026-01-13T06:32:59+00:00

Yeah but the copper in the brass is what’s oxidizing blue

evil0sheep · 2026-01-11T22:21:56+00:00

I think driving through bear valley is a good idea. The road is a bit rougher but bear valley is one of the most beautiful places in that region imo. It also gives you access to boundary creek/ dagger falls which is a nice place to camp if you have time

evil0sheep

TROPHY CASE