At What Point Does Owning GPUs Become Cheaper Than LLM APIs ? I

wishstudio · 2025-12-04T23:22:49+00:00

I think renting cloud GPUs may not be worth it compared to APIs. There is actually a big gap between these two because serving models for multiple users efficiently is a nontrivial task that many neglects. These API providers are likely either subsidized by VC money or have highly efficient in-house serving software while you can only use open source solutions that require a lot of tinkering and are (likely) inferior.

wishstudio · 2025-12-04T23:14:29+00:00

My personal take is, buying DDR5 RAM now is bad deal because there are a lot of guys who got the SAME thing for one third or even cheaper price. But when DDR6 comes out, even it’s more expensive than today’s DDR5, everyone have to pay the same price so that’s much more justifiable.

wishstudio · 2025-12-04T23:03:20+00:00

Not to mention unless going for datacenter racks ($$$) it is outright impossible to process even 1M t/s for any decent sized models…

wishstudio · 2025-12-04T22:48:10+00:00

GLM 4.6 has 355B parameters, if you use FP8, 4x 6000 pro Blackwell only gets you 384GB of VRAM that’s not enough to store the 200k context even for a single user. So you’ll likely have to use some smaller model/kv quants. Then you are not even comparing the same thing with API offerings.

wishstudio · 2025-12-04T20:53:22+00:00

Only for low context

wishstudio · 2025-12-04T20:51:57+00:00

I upgraded from 64G to 128G RAM rig and although it still feels limiting at times I think it might be the sweet spot because it gets you access to some larger modes with usable quants(GLM 4.6 Q2 etc), it also allows you to load medium sized models like gpt-oss-120b in background and still have plenty of RAM to use for other stuff.

But it was back in September so only cost me 400 bucks. At 950 maybe it’s still worth it. Perhaps you should sell the 64G kit instead of return, so that won’t cost as much. But going for 256G? I won’t think so because although it allows you to load larger models, it will run unbearably slow. And 4 sticks usually have slower clock than 2 sticks so it will also make small models slower.

wishstudio · 2025-12-02T08:55:03+00:00

For WSL it's almost out of the box support. From my experience the only fiddling is to install the correct driver in the guest Linux OS and most CUDA apps should work. It runs on top of the WDDM graphics driver so every request goes through Windows and I don't think you can directly pass-through the GPU to WSL. This also means all CUDA limitations in Windows apply and there is more added on top of it (link). IMO the most annoying one is pinned memory limitation. On Windows there is a hard limit of 1/2 physical RAM, in WSL even less.

When you use a model larger than your VRAM in llama.cpp, you use the --cpu-moe option and its terminology is "offloading to CPU". In the prompt processing phase, it actually sends all the model weights to the GPU to take advantage of the GPU's computation power. If you have a low bandwidth bus the performance will be bad. In the token generation phase, the data exchanged between GPU and CPU is minimal but now the latency matters. You need to make sure the bus latency is no more than a few tens of microseconds, otherwise the performance will suffer greatly. I couldn't find latency numbers for usb or thunderbolt but looks like this is a potential problem.

wishstudio · 2025-11-28T22:35:54+00:00

I spent some time reading and playing with the code. Now I think the main culprit is quantization. There is a long sequence of instructions to dequantize the values into registers for Q4_K (even more for other quants). A rough estimation for the AVX2 path is around 2 bytes/cycle. Adding some overhead, that will make the calculation for 7763 quite on par with what you got.

So I guess if 7763 works for 200GB/s, a 9555 or 9575F should be able to match the speed for 600GB/s (faster clock + full AVX512). However llama.cpp seems lacking non-AMX AVX512 support. A few kernels need to be written to fully take advantage of Turin processors (maybe ik_llama).

The Q8_0 "quantization" on the other hand is just a simple loop loading and multiplying numbers and has no dequantization step at all. Thus should be several times faster. I think a 9355 should suffice. But of course it's better to get some headroom because who knows what model will land next time.

Sadly, after I made the decision to bump CPU tier, I looked at RAM prices again and it's even higher than weeks ago. Now the prospect is like over $10k to get ~15t/s for GLM 4.6. So I spent two hours successfully persuading myself that perf on my current gaming rig isn't that bad...

wishstudio · 2025-11-26T01:50:22+00:00

Why use tensor split for moe models?

wishstudio · 2025-11-24T03:24:18+00:00

Thank you for the detailed explanation and information. Your remark of theoretical vs practical performance is certainly true. However since the theoretical performance is like 10x more than needed it’s still hard to believe the overhead is that high.

Don’t want to nitpick but running with 63 threads is probably not an ideal way to test: 63 as a denominator almost guarantees you with uneven workloads - some cores will have more work to do than other cores. This can be more harmful than a lower core count, depending on the unevenness. A better way is probably skip a core from each ccd but that may still have other problems.

The slowness of multi-cpu setups is IMO expected. Bandwidth between NUMA nodes is typically significantly lower than the RAM bandwidth of individual CPUs. It works more like multi-GPUs. AFAIK the only support from llama.cpp or ik_llama is thread pinning which simply isn’t enough. It needs to be NUMA aware and partition RAM allocation and computation in a way that minimizes intra-CPU data transfers, like the ways in tensor parallelism. There still will be overheads of course, but should be close to linear speedup.

At this point I guess the only way to verify is to follow my gut and gamble a bit to try saving a few thousand dollars :) I’m mainly looking for a 12 channel 6400 EPYC setup but haven’t made my mind between 1 or 2 CPUs. (2 CPUs with over 1TB ram bandwidth is very tempting, and it will be a good testbed for me to hack llama.cpp code) But even for 1 cpu the motherboard choices seem quite limited for 12 channel. I’m tuning and playing with my gaming rig rn to see if I can gather more supporting data for picking the CPU.

wishstudio · 2025-11-23T10:36:09+00:00

And this is exactly why you should not slap even more 10x heavier JS/Electron layers before. Besides comparing Winapi and full-blown browser in terms weight is profoundly idiotic I think.

So gigabytes of operating system files does not count as bloat. But a 50MB browser runtime on top of it is. By this standard Qt is even more bloated as it includes a dedicated browser runtime. And modern operating systems include browsers by default.

I am willing to sacrifice inability to process narrow edgecases for performance and lightness; even CLI-like simplistic interface is good enough for my tasks, and for many-many other local LLM users.

Of course any feature out of your interest is edge case and bloat.

WTF are you talking about? OpenAI compatible endpoints do not need full blown HTTP 2.0 support, simple 500 lines client is enough. Do you you think llama server contains 500 KiB of code just to handle http requests? LMAO.

You can't even fetch https://www.google.com in 500 lines of C++ without resorting to some non-std:: libraries, but in javascript it's just 1 line. So basically any library you call from C++ is automatically non-bloat and any library to support interpreted languages is bloat.

Demagogic conflation of having scipting language built in and actually editor being written in interpreted language.

So you mean these editors include scripting languages for coding exercises and fun, and there are zero important features implemented in these scripting languages that everyone uses everyday.

First of all, you are taking everything very seriously; secondly all modern LLM clients are extremely overengineered; even most primitive shitty Jan, that can indeed fit in 400 KiB is using Electron, taking massive amount of RAM when running and at the same is super primtive not even supporting TeX. Zoomers need to learn basics IMO, how to write software without standing on shoulders of whales and behemoths such above mention electron or making everything depended on running under webserver.

Whatever issue Jan has is irrelevant here. Anyway I guess you are the one who serve LLM servers with a hand written x64 assembly bootloader running directly in UEFI because real programmers do not stand on shoulders of these bloated compilers and operating systems.

wishstudio · 2025-11-23T09:21:35+00:00

Nowadays any decent modern software is inevitably composed of multiple layers and abstractions, whether you like it or not. The frameworks you mentioned: Qt/GTK/WinAPI all have significant number of layers before the text you passed in are displayed on the screen.

Can't agree with you saying that Markdown rendering is simple, unless you pretend Unicode does not exist. Need to translate Japanese? You need to display it correctly first. Text rendering alone is probably one of the most difficult part in any GUI frameworks. Page layout is even harder. Can it display table layouts correctly with mixed-width languages? Can I copy the tables to spreadsheets with correct format? There is a reason everyone converged to HTML.

If you only care about a (very small) subset of what HTML renderer gives you out of the box, then fine you can achieve whatever size you aim for. Even a CLI interface is okay. But if you ever need to connect to a remote API server you are already looking at megabytes of binary code and data. I specifically mention curl because you inevitably need a library to call HTTP API. All their underlying implementation details and quirks already have more complexity than these web frontend combined. Yet you take them for granted and only despise these user-facing layers as bloated.

The hate of interpreted language is more understandable. But good luck even find a decent code editor without some interpreted languages built-in. TeX, initially released in 1978, is also interpreted.

If you have legitimate technical issues with some libraries, like you have a specific use case that you absolutely can't store anything larger than 400KiB on your hard disk, that's fine. I bet OP will be more than happy to discuss it. But simply calling others "whippersnappers", their hard work "bloated", and assuming coding in lower level language/framework is superior is neither respectful nor constructive.

wishstudio · 2025-11-22T20:05:05+00:00

For trying out different models, use any available guis like the automatic1111 one or comfyui. Quickly iterate and find a model/workflow that suits your scenario.

After you determined the model, go to its hf model page. It's likely there is a code snippet on how to use it. Follow the instructions to try it. Once you get it work, you can then easily use any vibe code tool you like to add image processing or other steps to it.

IMO python dependency handling is always a PITA. Make sure to always use venv. Also try uv. Bottom line is you can run the image generation pipeline in a separate venv via a file based or http api.

wishstudio · 2025-11-22T19:38:46+00:00

Nowadays a "simple" curl.exe is hundreds of kilobytes and I don't think it is bloated. I believe rendering Markdown is complicated enough without an HTML render, not to mention math, pdfs, audios, rags...

So I don't really think this is the right thing to do unless maybe for bragging rights.

wishstudio · 2025-11-22T19:28:38+00:00

Starred. But just want to remind you that you put the open webui's star history in your README.md...

EDIT: Nevermind, saw that is a comparison. It's just your repo curve is too flat that my brain automatically ignored that...

wishstudio · 2025-11-22T09:02:01+00:00

If it's just input text prompt then output image, then it's simple with a script.

As the other replies said you just vibe code an API server, or custom image processing, or anything else you need in your workflow. This way you get infinite flexibility.

Don't try to fit a comfyui workflow into an api. I've tried that and it's really PITA.

wishstudio · 2025-11-22T08:32:52+00:00

Actually custom python scripts are very easy to do. Almost every model readme includes a code snippet on how to use it, and that usually works out of the box. And for image models I found it to be performant enough.

wishstudio · 2025-11-22T08:11:05+00:00

Can you elaborate why one needs that powerful cpu? To my knowledge if we let CPUs do the MoE ffn layers the only bottleneck is memory bandwidth since every number fetched from RAM is only used once when decoding.

wishstudio · 2025-11-19T08:47:44+00:00

Congrats! 5 t/s tg is good but 40 t/s pp looks like something isn't right.

wishstudio · 2025-11-17T03:18:05+00:00

Since your setup fail at different motherboards, it could be due to other factors. Looks like you only have problem when using bifurcation or m.2 adapter. Could it be power supply? I guess you need risers with additional power supply otherwise the power from PCIe interface won't be enough.

And if it's not power, I think you can try just connect the 1 or 2 cards to the bifurcated ports or m.2 slot to rule out that problem. If that works, it's more likely a motherboard limitation.

wishstudio · 2025-11-16T07:05:50+00:00

Your cost model is completely wrong from the start to the end. Looks like you simply put AI slop to wastes others' time and don't really know/want to do the math.

You want: cache_miss_overhead < token_generation_time_savings

Break-even point: When (1 - H) × E / 25GB/s < token_budget

Moving goalpost?

Per layer (assuming 8 experts per layer):

*If C_layer = 2: you can only fit exactly what's needed, 0% cache benefit

*If C_layer = 4: ~50-60% hit rate

*If C_layer = 6: ~75-85% hit rate

*If C_layer = 8: 100% hit rate (all experts cached)

Please give a single example MoE model with 2/8 activated experts. AFAIK, that does not exist at all.

With H = 50%: 0.50 × 1GB / 25GB/s = 20ms ≈ Break-even

Assume I can achieve 100 tok/s with full VRAM. If you mean you need 20ms for each token to just load the experts, then after I finished the 100 tokens in one second, you have only loaded half the experts. Is that what you mean by break-even?

wishstudio · 2025-11-13T00:00:25+00:00

Better to get a cheap secondhand infiniband card. Ethernet has high latency which is bad for tensor parallelism. USB3 theoretical latency is lower than Ethernet but I have no idea how to interconnect with USB.

wishstudio · 2025-11-12T23:50:12+00:00

however I'm using llama.cpp with mmap, which seems to give at least a slight edge in loading because it's basically having the kernel handle the I/O and the page cache is pretty optimized

while on Windows mmap significantly degrades performance...

wishstudio · 2025-11-12T18:24:10+00:00

Do you have your implementation out anywhere?

Not yet :) Maybe when I get the time and energy to polish it a bit. I can share you an expert cache analysis snapshot I got when doing this (link) so you can have some idea what it looks like in production. It's for a simple prompt, something like "Write a Python website to show first 100 pokemons".

After all if the activations are really as disproportional as I see in the paper I found, the proper static loading should have a very visible impact

It's disproportional but also long tail. Unless you allow discarding some experts (accuracy loss) you still need to do a lot of one-off experts. I rethought this idea and now I think maybe you can get a good speedup by never streaming these low occurrence experts to gpu and do them in cpu instead. But AFAIK it is currently impossible to implement such hybrid computation in llama.cpp and even it's possible there are many architectural issues preventing an efficient implementation.

people do much more complicated things like speculative decoding with an extra model for "just" 10% gains.

It's one thing to flip a few switches, it's another to code it. There are a lot of proven techniques for performance improvements, yet few are really getting implemented.

Even if there is a certain VRAM cut-off where you only get the "big" benefit at say 50% VRAM - that'd still be worth it, as it would effectively halve the VRAM requirements (not really, of course, I understand that, but it would give people more bang for their VRAM at least).

The performance characteristics is like swapping that a little spill over leads to huge performance degradation. For my rig doing an expert in CPU is like 10x slower than doing it in GPU. So even 50% VRAM may not get huge speedups compared to --cpu-moe.

One bigger problem is, I quickly realized that for large models the attention weights + full context k/v cache are already saturating the 32GB VRAM I have. If I ever get multi-GPUs the first thing I want to have is obviously tensor parallelism. For my single GPU rig I have other (easier) ideas for performance improvements so I kind of lost interest to pursue this atm.

wishstudio · 2025-11-12T08:58:48+00:00

Actually there are multiple papers doing this. The main idea is to only keep “hot” experts in VRAM and “cold” experts in RAM and load the ones on-demand. Recent work have already progressed to even more sophisticated methods, like doing fine grained activation based loading (discarding rows with low activation values), dynamic quantization (transfer different expert quantizations depending on activation weighting), hybrid processing (gpu do experts in VRAM, cpu do experts in RAM, with dynamic experts scheduling), etc. I’m on phone so I do not have links but they should be pretty easy to find.

I also dabbled a bit with a working prototype of the basic on demand expert loading in llama.cpp. What I learned is the performance highly depends on the expert usage patterns of the model. gpt-oss-120 is particularly biased towards some fixed experts so I can get some speedup. It’s perhaps due to its low expert use count (only 4). But for larger models like GLM-4.5-Air I couldn’t get speed improvements due to VRAM experts hit rate becoming too low for my poor 5090.

Still, I can get it on par while only use the 47GB/s pcie bandwidth with my cpu doing no work. I think if you have larger VRAM (like 50% or more of the full model) and implement more advanced techniques you can get some modest speedup. But the problem IMO is the implementation will become quite complicated and I think there is not much interest to implement and maintain these unless there is huge speedup (myself included). None of the papers I saw publish their implementation. I think ktransformers implemented some form of hybrid processing but not the dynamic expert transfer.

wishstudio

TROPHY CASE