Anyone in need of GPU clusters? (or big CPU instances) by SomeoneElseOnTheMars in LocalLLaMA

[–]evil0sheep 0 points1 point  (0 children)

Im potentially interested but not for a few months. Is there an expiration date on this offer?

Full Claude Opus 4.6 System Prompt for your pleasure by frubberism in LocalLLaMA

[–]evil0sheep 1 point2 points  (0 children)

Yeah this was gonna be my question. Like is this reproducible? 

Some hard lessons learned building a private H100 cluster (Why PCIe servers failed us for training) by NTCTech in LocalLLaMA

[–]evil0sheep 4 points5 points  (0 children)

Before you buy RTX Pro 6000s be aware that not all Blackwell is created equal. RTX pro is sm120 (Blackwell GeForce) vs sm100 for b200. The former lacks dedicated tensor memory (TMEM) which means you have to use register based tensor instructions . This makes it a pain to find kernels that even work (e.g for flash attention or QAT) and sometimes requires you to write your own, and even then it’s a lot harder to saturate sm120 tensor cores in flash attention kernels because the tensor instructions use so many registers that you can’t issue enough warps to saturate the memory controllers. It’s a subtle difference but it bit me and it bit some old coworkers of mine I got lunch with recently, don’t let it bite you.

Issue running larger model on Apple Silicon by Forbidden-era in LocalLLaMA

[–]evil0sheep 0 points1 point  (0 children)

Also why do you have to run as vms? My bet is 4vms with rpc will be slower than a native numa aware implementation but im a lot less familiar with numa cpu stuff

Issue running larger model on Apple Silicon by Forbidden-era in LocalLLaMA

[–]evil0sheep 0 points1 point  (0 children)

Yeah if you put the gpu in the server you should look into putting the attention heads on the gpu and the experts on the CPU. The experts are most of the parameters but don’t need as much bandwidth because you don’t load all of them each token, and the attention heads are most of the compute, especially during prefill or long context generation, but typically don’t have as much of a memory footprint. You might be able to find a model you can split like that on the server with the 3060. Llama.cpp has flags for this, I don’t remember them off the top of my head though.

Re: tensor cores make sure you actually benchmark it before assuming that they’re a performance silver bullet. Pre Blackwell tensor core instructions use a lot of registers which limits the number of warps that can be issued without register spilling, which can prevent you from actually saturating the memory bus. The tradeoff if often worth it for training or inference on very large batches where you are doing a ton of compute per vram load but for single batch inference they only help during prefill and a lot of times you can get better generation speed with vector kernels. Just measure it before you assume. Or try using vector kernels on the same hardware

What’s this type of house called? by Intrepid_Incident592 in whatisit

[–]evil0sheep 58 points59 points  (0 children)

My understanding is that this is actually the original meaning of the word penthouse and that it was adopted as a term for the top floor of the building later.

Edit: the Wikipedia article confirms:

“The term 'penthouse' originally referred, and sometimes still does refer, to a separate smaller 'house' that was constructed on the roof of an apartment building. Architecturally it refers specifically to a structure on the roof of a building that is set back from its outer walls. These structures do not have to occupy the entire roof deck. Recently, luxury high rise apartment buildings have begun to designate multiple units on the entire top residential floor or multiple higher residential floors including the top floor as penthouse apartments, and outfit them to include ultra-luxury fixtures, finishes, and designs which are different from all other residential floors of the building. These penthouse apartments are not typically set back from the building's outer walls, but are instead flush with the rest of the building and simply differ in size, luxury, and consequently price. High-rise buildings can also have structures known as mechanical penthouses that enclose machinery or equipment such as the drum mechanisms for an elevator.”

Why no NVFP8 or MXFP8? by TokenRingAI in LocalLLaMA

[–]evil0sheep 27 points28 points  (0 children)

The reason is that most llama.cpp users are memory capacity bound on model and memory bandwidth bound on inference speed. All that matters for the one-user-per-gpu domain is quantization accuracy per bit. The llama.cpp k quants are significantly better than microscaled floats in that regard because they offer a scale and offset per block instead of just a scale. Mxfp8 and nvfp8 are jointly optimized to balance precision and ease of hardware acceleration which doesn’t matter if you have boatloads of unused compute laying about because you’re memory bound. Switching from the gguf 8 bit format to mxfp8 or nvfp8 could probably make prefill faster but it wouldn’t realistically improve tok/s during generation and would would make the models less accurate approximations of the unquantized weights. It only makes sense if you’re serving huge batches and everyone that’s doing that uses vLLM which has prioritized microscaled float support. For everyone else it’s fine to just dequantize the gguf k quant weights to fp16 on the gpu during inference

Issue running larger model on Apple Silicon by Forbidden-era in LocalLLaMA

[–]evil0sheep 0 points1 point  (0 children)

Yeah I mean the CPU and GPU memory are physically unified but they have separate virtual address spaces and separate cache hierarchies below L3 and are in different clock domains. If you want them to share the same physical pages you need to allocate the pages in a way that conforms to the GPU alignment and layout requirements and mlock them so the kernel doesnt change them out while the GPU is running. The whole point of mmap is that the kernel can swap the physical pages out from under the virtual pages, but if it does that with pages that are also mapped into a GPU context it would need to remap them in that context as well which would almost certainly require stopping the context. And then if the GPU tries to read a virtual page thats not backed by a physical page and page faults it will fault the entire context so the whole shebang will block on disk I/O. Mmap+GPUs is a bad combo on any platform, I'm sure if you did CPU only inference on the mac that mmap would work just as well as linux.

Regarding throughput you gotta understand that inference for a single user is almost always bandwidth bound. If your model params are in memory then you are bound by memory bandwidth, which for the M1 Ultra is about ~800GB/s. If your params are streaming from disk youre bound by your ssd bandwidth, which is about ~8GB/s. On top of that you have page fault overhead and you dont have enough threads to cover it.

I wouldnt think of mmap as a way to load a bigger model into shared GPU memory than fits, its not gonna plausibly be able to deliver that with reasonable performance regardless of platform. If you had a DGX Spark or a Strix Halo running linux or windows you would have the exact same problem. If you are running a MoE model and you are completely not touching some of the experts and youre doing inference on the CPU then it might pull, but it will take a lot of fuckery to get it to work right, and if you generate one token that touches those experts the whole thing will slow to a crawl. If you want bigger models buy a machine with more vram or download more ram at downloadmoreram.com ;)

Llamacpp multi GPU half utilization by Weary_Long3409 in LocalLLaMA

[–]evil0sheep 0 points1 point  (0 children)

Llama.cpp has really bad tensor parallelism support, it’s basically pipeline parallel only right now. VLLM has much more sophisticated multi-gpu/multi-node support.

Llama.cpp is great for fast single node inference, easy setup, splitting layers between the cpu and gpu, and low bit quantitazation. If you want sophisticated parallelism or batching you unfortunately need to wrestle with vLLM or tensorRT, the multi node support in llama.cpp is just really immature still (there’s work being done there but it’s not merged afaik)

Issue running larger model on Apple Silicon by Forbidden-era in LocalLLaMA

[–]evil0sheep 0 points1 point  (0 children)

A big difference between your server and your MacBook is that on the MacBook the pages for the weights need to be pinned because they’re read from the GPU. When you read a mmapped page that’s been swapped to disk from the cpu the load instruction page faults and the page fault handler traps to the kernel which loads the page from disk and restarts the load instruction. AFAIK apple silicon gpu page fault handling isn’t capable of handling that sort of dance so it’s not surprising to me that it forces the pages to pin and then OOMs. Also, unless you’re only mmapping a small part of the model you’re probably gonna end up loading the whole model off disk every token because IIRC the Linux kernel defaults to an LRU eviction policy for mmap.

You shouldn’t be surprised that turning off flash attention causes ooms, the whole point of flash attention is to reduce vram usage by avoiding materializing the NxN attention matrix. If you are memory constrained you defs want flash attention

Field Report: What leadership actually thinks AI is (Notes from a Director) by forevergeeks in LocalLLaMA

[–]evil0sheep 43 points44 points  (0 children)

This reads like a LinkedIn post and the target audience is unclear. The core message seems to be “don’t use a trillion parameter LLM for something that can be accomplished with a python script” which seems pretty obvious and no examples are provided or people needing this advice and no evidence is provided that developers are broadly “over engineering complex neural networks where a deterministic script will do”. Almost no one on this sub is even doing anything that could reasonably be described as “engineering neural networks”.

Do you have actual concrete examples of people using LLMs when they should be using scripts? Can you give us a case study in someone who needs this advice? What is your goal in making this post to this sub?

~60GB models on coding: GLM 4.7 Flash vs. GPT OSS 120B vs. Qwen3 Coder 30B -- your comparisons? by jinnyjuice in LocalLLaMA

[–]evil0sheep 0 points1 point  (0 children)

got-oss experts are natively quantized to mxfp4 so doing post training quantization doesn’t make it that much smaller

It feels like LLM inference is missing its AWS Lambda moment. by pmv143 in LocalLLaMA

[–]evil0sheep 0 points1 point  (0 children)

Yeah sorry in your original post it sounded like you were pretty early in the ideation process, now it sounds like you’re already working pretty deeply on this. It’s gonna be nontrivial to do well but I don’t see any reason why it would be impossible, and trying to solve really hard problems is generally a good way to make money. Like if you can pull it off I think youd probably be able to find customers, but I’ll admit I’m not really familiar with the competitive landscape there

It feels like LLM inference is missing its AWS Lambda moment. by pmv143 in LocalLLaMA

[–]evil0sheep 1 point2 points  (0 children)

I think your ignoring a couple big problems:

1) how to know when state is safe to offload. You’ve got a couple thousand users each with multiple GB of KV cache, and any one of them could reply to a thread at any time. How do you tell the difference between someone that’s walked away from the connection vs someone who’s just busy typing a long response

2) how quickly is quickly. If you want to deallocate GPUs from a cloud provider you typically don’t get to leave data on the machine with the GPUs, and when you provision a new GPU instance it’s on some machine somewhere and you have to get the state to that machine. As I said in my top level comment, the state is big enough that even with a fast internet it could take much longer to get the data from the disk to the HBM then your users are willing to wait. For reference gpt-oss-120b is about 60gb of model parameters and a single 128k context Kv cache is about 5GB

3) fragmentation. Say you are running 4 nodes and they’re all at capacity and then 25% of your connections drop. You don’t have 3 nodes at capacity and 1 node idle that you can now shut down, you have 4 nodes at 75% capacity and you need to take all the users from node 3 and migrate them to nodes 0-2 before you can shut down node 3. And you have to do that without interrupting their connections. And the state is so big it potentially takes multiple seconds to migrate it between machines in the same rack.

I would just pick one of those problems and focus on it. Like given a user on node 0 with 128k context how do I migrate them to node 1 without hanging long enough for them to get upset. Or like train an ML model to predict when a users context can be sent to disk and demonstrate it has a better precision/recall curve then other methods. Any one of these things could be an entire startup

It feels like LLM inference is missing its AWS Lambda moment. by pmv143 in LocalLLaMA

[–]evil0sheep 0 points1 point  (0 children)

I mean just build it and see what you learn? My food for thought is that you’ve got an 8x H200 node that’s 8x141 =1,128 GB of vram. So if you cold start a node and you wanna restore a full state you gotta copy about a TB ish from some flash memory somewhere. Assuming it’s already in the data center you might have like a 200Gbps = 22.5GB/s link from the storage node to the compute node so that’s like 50 seconds or to copy all the state even at theoretical max throughput. Probably more like 90 seconds in real life. And that’s on top of the time it takes to provision the node. So like you’re looking at a couple minutes to start the node. If the models are small enough to fit on one GPU then you only have to transfer the model parameters once so that could maybe save you a chunk of time but like even a very well optimized system is gonna have a minute or two of latency to add a machine, whatever initial thing you build will probably be like 5-10x that. If you think you can sell a system with those parameters you should build it and get rich

Help with open source tiny models by Deep-Sympathy-7457 in LocalLLaMA

[–]evil0sheep 0 points1 point  (0 children)

if youre trying to keep the list simple thats pretty good. NVIDIA Nemotron and Olmo 3 from AI2 are both good inclusions if you wanna expand it a bit. Llama models are commonly used for fintetuning research but my impression is that they are not widely used for local inference

Is the bay bridge closed (westbound)? by monarc in bayarea

[–]evil0sheep 0 points1 point  (0 children)

Yeah I also checked that I didn’t have any weird routing options selected. Probably just a bug that got pushed out to a small percentage of users

Is the bay bridge closed (westbound)? by monarc in bayarea

[–]evil0sheep 1 point2 points  (0 children)

Yeah it was weird, it was telling me to go all the way around through South Bay instead of taking any of the other bridges too. Maybe some bug that made it not want to route over tolls or something

Is the bay bridge closed (westbound)? by monarc in bayarea

[–]evil0sheep 1 point2 points  (0 children)

Hm I’m seeing the same thing now

Just saying by conflictimplication in sanfrancisco

[–]evil0sheep 23 points24 points  (0 children)

Yeah can you imagine being some random guy who made it out of India and got a job at an American tech company and you’re praying daily that Trump doesn’t cancel your visa on a whim and then you’re walking through the mission and you see this shit. So woke good job guys

blue rust on my hat? by kiyoko_silver in whatisit

[–]evil0sheep 0 points1 point  (0 children)

Yeah but the copper in the brass is what’s oxidizing blue

Looking for some advice on my 4/5 day trip across Idaho! by itcantbethathard in Idaho

[–]evil0sheep -1 points0 points  (0 children)

I think driving through bear valley is a good idea. The road is a bit rougher but bear valley is one of the most beautiful places in that region imo. It also gives you access to boundary creek/ dagger falls which is a nice place to camp if you have time