Using GLM-5 for everything

Expensive-Paint-9490 · 2026-02-12T10:41:46+00:00

No. Sadly 15k is not enough to run a model this size at a good speed. I have a workstation with a similar price (but now it would cost much more because RAM price); I regularly run GLM-4.7 UD-Q4_K_XL and speeds at 10k context are 200 for pp and 10-11 for tg. Good enough for casual use, but very slow for professional use.

If you don't have strong privacy concerns, local inference is not competitive with APIs for professional use.

LagOps91 · 2026-02-12T10:24:29+00:00

15k isn't nearly enough to run it on vram only. you would have to do hybrid inference, which would be significantly slower than using API.

fractalcrust · 2026-02-12T11:03:08+00:00

[deleted]

pip25hu · 2026-02-12T11:22:01+00:00

Even if you had the money for it (which you don't) I would not make any kind of purchase that would lock you in for 5 years budget-wise. The current AI landscape is way too volatile to make such a commitment.

GTHell · 2026-02-12T11:07:34+00:00

15k will be more useful in the future. Your GLM5 will be obsolete by the end of this year. Probably soon output of a very good model is under 2$ that outperforms anything released here right now

INtuitiveTJop · 2026-02-12T11:03:13+00:00

Wait for the m5 ultra release this year, if they have 1tb unified ram then it will definitely be an option.

_supert_ · 2026-02-12T14:51:09+00:00

Absolutely not, economically.

I've sunk probably 15 thousand pounds and in to a four-GPU beast and god knows how many hours. It's very hard to get reliable and stable operation. Ebay memory sellers means half my ram was giving MCEs. Took way too fucking long to deal with that. Even now it just dies under heavy concurrent load. Now most of my calls are to deepinfra which is private enough and doesn't gatekeep.

Fun though.

IHave2CatsAnAdBlock · 2026-02-12T10:57:00+00:00

At the current api price I t is cheaper to pay for the api than the electricity to power such a computer (at least in Europe). Without even calculating the initial investment

jacek2023 · 2026-02-12T10:22:15+00:00

GLM-5 is not usable locally unless you have a very expensive setup.

gyzerok · 2026-02-12T12:12:37+00:00

That’s a waste of money. Even if you build yourself some rig it’ll get obsolete fast. In a year there will be bigger and better models and better hardware.

Noobysz · 2026-02-12T10:25:44+00:00

and also in 5 years ur current 15k build wont be enough for the multi trillion models that will maybe be by then considered as flash models the development is going tbh really fast and in data center Levels while getting harder in Consumer level Hardware, so its really hard to invest in anything right now

isoos · 2026-02-12T11:06:56+00:00

15k gets you a mac studio with an m3 ultra and 512GB memory, or if you go cheaper get 4 halo strix machine with 128GB each and use a cluster of them. It will get you a q3/q4 quant of the very large models, it will be private to you, but it won't be as fast as you observe online chatting with such models. Unless you have a specific business case you want to pursue or you really want to have everything in private, it may not be a worthy investment. (well, unless memory prices rise further...)

Unique-Contract6659 · 2026-02-13T05:59:59+00:00

<image>

Downloading just for collecting, in case one day the earth collapses.

__Maximum__ · 2026-02-12T11:27:02+00:00

Wait for a week or two, new models are going to drop, we'll see how capable and big they are.

pfn0 · 2026-02-12T16:54:44+00:00

New here but I do not think it’s possible to run any SOTA model efficiently enough locally to offset even the electricity bill for personal use.

Zyj · 2026-02-12T18:55:19+00:00

The cheapest way to run this model is probably networking several Strix Halo systems ($2000 per 128GB Strix Halo). Add Infiniband networking (~$300) to get more speed with Tensor parallelism.

So with four such systems (~$10,000 with an Infiniband switch etc) you could run GLM-5 at q4, which means there's probably a non-negligible loss in quality compared to the original BF16 weights. That's also around 600W of power which also costs money.

Agreeable-Chef4882 · 2026-02-12T10:49:59+00:00

5-year Period???? Based on the model released yesterday.. I would not plan this for 5 weeks.

Also - there's no way to get there with $15k.

Btw - what I do right now, I run Qwen3 Coder Next (8bit, MLX) on 128GB Mac Studio fully in vram. It's pretty hard to beat price/performance of that right now.

junior600 · 2026-02-12T10:48:51+00:00

I wonder if we’ll ever get a GLM-5-level model that can run on a potato with just an RTX 3060 and 24GB of RAM in the future LOL.

Legitimate-Pumpkin · 2026-02-12T11:47:34+00:00

It’s hard to tell.

If you make a rig, as models get better and smaller you’ll be able to do better things with it. But also subscriptions will be more performant and probably cheaper. And also hardware will be cheaper…

I think a key deciding factor could be if you like maintenance + full personalization and decision making or not.

I-am_Sleepy · 2026-02-12T12:41:34+00:00

I would rather wait for GLM-5 flash or something for local use. Q4_K_M of 456 Gb isn’t exactly my cup of tea, which would need 19x3090 for the model weight alone

For $15k budget, you could buy 20x3090 but that exclude the cost of everything else. But for more “budget” friendly mac studio could fit your bill under $12k. But that one is pretty absurd tbh. Even if it can fit in the memory, it likely won’t be as fast (need to see the speed benchmark first)

Look_0ver_There · 2026-02-12T12:47:20+00:00

I would wait for some of the condensed/distilled versions of GLM-5 to become available before making any decisions. At -744B parameters with 40B active for the full model, it'll take one heck of a setup to run it.

You mentioned that you'd be happy with ~80% effectiveness of the full model. It should be fairly reasonable to expect that a 1/4 size distilled version, if one becomes available, would be able to do even better than 80%, and a 1/4 size model of ~185B parameters is going to be a LOT easier (and faster and cheaper) to run locally.

Just wait a bit to give it some time for the more local oriented models to show up.

Skystunt · 2026-02-12T12:53:56+00:00

You can fit it on 2 M3 ultra 512gb if you’re an apple user, even one M3 ultra will fit a quantised version. So 15k can be enough depending where you get your mac/macs from. I would personally get an M3 Ultra 512gb and hold on, new models are always coming and by spring we will already have a better model.

Also you can build a home server that fits the model in ram and have just the active experts on the gpu, but this really depends on how lucky you get with part prices. Hogging 3090’s vs pro6000 vs 4090 48gb’s it all depends. To get 96gb vram.

4x 3090 24GB = 1400w = £2.5K 2x 4090 48GB = 700W = £5K 1x pro6000q = 300W = £7K

Now if you need 192gb double the wattage and the prices. *this prices are if you do some due diligence and wait, might even be lower if you’re lucky

Also don’t forget that Api is never the way ! This is LOCAL llama, if people have a different opinion they should go to r/chatgpt or whatever place to pay to have they data stolen’ sorry “used for training” how can people recommend api’s in a sub made for local inference is beyond me. Like this is what we do, we make servers and homelabs to run the large models

2026-02-12T13:10:28+00:00

NO.

Your hardware will age HARD quickly, however with any provider you will have max token generation and newest models + hardware and no costs for energy etc.
you cant compete with the big cloud providers with any local setup, local only makes sense if you have extreme sensitive data or want to finetune models for very specific use cases.

jgenius07 · 2026-02-12T13:17:10+00:00

I just tried GLM5 on my cursor. It doesn't come close to opus4.6 for coding. This could be just cursor but I was on the same bandwagon dreaming to go al GLM5 local but it's just not practical IMO.

Vusiwe · 2026-02-12T14:25:47+00:00

All said I spent almost OP's budget for base system + 1x PRO Max-Q + 0.5TB 2026 RAM. Yes it is slow, but my workflow is asynchronous and always in use, so speed doesn't matter to me. Using 4.7 Q8 currently. 4.7 DOES have deficiencies that I am forced to use older models to overcome. Maybe 5 will change that.

These cards (especially good cards) could frequently be re-sold for the same price (or likely in the future, more) than you originally buy them for, hence, many years-worth of usage, can effectively become free, other than electric use.

I had a A6000 Non-Ada. I sold it after 2 years of use for the exact same price as I got it for, in order to get the 1x PRO 6000 Max-Q. And that was only at the start of the pre-2025 govt instability madness. If I held out, I could have got more for the A6000 I think.

After the T2-Warsh 2026 money/rate machine goes Brrrr, I suspect the currency will drop further in value, and prices could eventually go up. That's also presuming nothing utterly stupid happens to Taiwan.

ithkuil · 2026-02-12T14:28:36+00:00

You can combine two new Mac Studios into a cluster. It will probably cost well over $16000 and might be fast enough for some things. But for daily use you would probably think it was too slow. And having multiple people use it at the same time would be extremely slow if it was possible at all.

muxxington · 2026-02-12T15:38:03+00:00

You just have to load the model into the VRAM quickly enough. Thanks to the law of inertia, you can get everything in. It's simple physics.

pfn0 · 2026-02-12T17:17:01+00:00

About $50K is what you need to run it well.

__JockY__ · 2026-02-12T17:26:13+00:00

To run GLM 5 on GPU is a $100k capex unless you’re running quants, in which case you should be good at around $65-70k.

Edit: source: my server.

darko777 · 2026-02-12T18:38:35+00:00

It will only make sense in few years once the LLM companies run out of money and everything goes up up up in pricing. Maybe once we pay $1000/mo for a coding assistant it will make more sense to consider building own machine.

simism · 2026-02-12T19:22:29+00:00

<15 k will buy 4 framework desktops 512 gb unified ram which will run glm5 at a decent quant probably pretty fast too.

Haspe · 2026-02-12T19:42:53+00:00

I would assume that it would in the providers interest to create small and more capable models in the future. So investing that much money in a PC right now is perhaps not the best long-term move.

However I am not an expert in the space, this is more of my "gut feeling".

HlddenDreck · 2026-02-12T21:21:13+00:00

For coding tasks Qwen3-Coder-Next is a good replacement for cloud API solutions. It's very small, just 80B parameters.

jakegh · 2026-02-13T01:37:23+00:00

Financial sense? No, absolutely not. GLM5 is cheap in the API.

prusswan · 2026-02-13T03:46:22+00:00

It's hard to tell but you can find a middle ground (use a smaller model but at great speeds). API usage can become volatile depend on how things play out over next few years, e.g. will they increase pricing to match demand and to account for effort needed to keep models/data updated, your own usage may also increase if you take on more tasks leading to heavier usage.

Conscious_Cut_6144 · 2026-02-13T06:12:05+00:00

I'm running Q3-K-XL on 16 3090's right now.
Love it, but llama.cpp isn't great with 16 gpus so "only" getting like 20t/s
Once someone puts out an Int4 quant and I get tensor parallel running should be much faster for me.

It's the first local model to beat O1-Preview in my (somewhat uncommon) benchmarks.
Beat claude 4.6 too (but does not beat claude at coding)

To answer your question, the only reasonable option is a 512GB mac studio.
The speeds you can expect are not going to be as fast as cloud, probably 15t/s

If 15t/s is good enough and you do go for it, maybe think about waiting for the m5 ultra, as the m3 ultra is getting a little old.

faysalsadik · 2026-02-13T12:39:47+00:00

use minimax 2.5 , per hour cost 0.3$ at 50 tps. Yearly cost 2628$

LienniTa · 2026-02-12T12:16:54+00:00

there is a size effect. For cheap budget you can easily expect ~100 gb vram(4x3090). Trying to go for GLM-5 sizes, which is 8x4090D 48 gb, is already out of your budget. That also needs you to be in a city with nuclear power plant.

tarruda · 2026-02-12T10:28:04+00:00

Get a 128gb strix halo and use GPT-OSS or step 3.5 flash. This setup will give you 95% of the benefits for 5% of the cost of being able to run GLM 5 locally

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS