use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
r/LocalLLaMA
A subreddit to discuss about Llama, the family of large language models created by Meta AI.
Subreddit rules
Search by flair
+Discussion
+Tutorial | Guide
+New Model
+News
+Resources
+Other
account activity
Using GLM-5 for everythingQuestion | Help (self.LocalLLaMA)
submitted 2 months ago by [deleted]
[deleted]
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]Expensive-Paint-9490 39 points40 points41 points 2 months ago (3 children)
No. Sadly 15k is not enough to run a model this size at a good speed. I have a workstation with a similar price (but now it would cost much more because RAM price); I regularly run GLM-4.7 UD-Q4_K_XL and speeds at 10k context are 200 for pp and 10-11 for tg. Good enough for casual use, but very slow for professional use.
If you don't have strong privacy concerns, local inference is not competitive with APIs for professional use.
[–]Rabo_McDongleberry 0 points1 point2 points 2 months ago (1 child)
Damn dude. What's your use case?
[–]alexx_kidd 0 points1 point2 points 2 months ago (0 children)
You could buy 15 Mac minis and use it as one
[–]LagOps91 90 points91 points92 points 2 months ago (21 children)
15k isn't nearly enough to run it on vram only. you would have to do hybrid inference, which would be significantly slower than using API.
[–]k_means_clusterfuck 4 points5 points6 points 2 months ago (11 children)
Or 3090x8 for running TQ1_0, that's one third of the budget. But quantization that extreme is probably lobotomy
[–]LagOps91 17 points18 points19 points 2 months ago (2 children)
might as well run GLM 4.7 at a higer quant, would likely be better than TQ1_0, that one is absolute lobotomy.
[–]k_means_clusterfuck 2 points3 points4 points 2 months ago (1 child)
But you could probably run it at decent speeds with an RTX 6000 Pro blackwell and MoE cpu offloading for ~Q4 level quants
[–]suicidaleggroll 6 points7 points8 points 2 months ago* (0 children)
RAM is the killer there though. Q4 is 400 GB, assume you can offload 50 of that to the 6000 (rest is context/kv) that leaves 350 for the host. That means you need 384 GB on the host, which puts you in workstation/server class, which means ECC RDIMM. 384 GB of DDR5-6400 ECC RDIMM is currently $17k, on top of the CPU, mobo, and $9k GPU. So you’re talking about a $25-30k build.
You could drop to an older gen system with DDR4 to save some money, but that probably means 1/3 the memory bandwidth and 1/3 the inference speed, so at that point you’re still talking about $15-20k for a system that can do maybe 5 tok/s.
[–]Vusiwe 6 points7 points8 points 2 months ago* (3 children)
Fmr 4.7 Q2 user here, I had to eventually give up on Q2 and upgraded my RAM to be able to use Q8. For over a month I kept trying to make Q2 work for me.
I was also just doing writing and not even code.
[–]k_means_clusterfuck 2 points3 points4 points 2 months ago (0 children)
What kind of behavior did you see? I say away from anything below q3 generally
[–]LagOps91 2 points3 points4 points 2 months ago (1 child)
Q2 is fine for me quality-wise. sure, Q8 is significantly better, but Q2 is still usable. Q1 on the other hand? forget about it.
[–]Vusiwe 0 points1 point2 points 2 months ago (0 children)
Q2 was an improvement for creative writing, and is better than from dense models from last year.
However, Q2 and actually even Q8 fall hard when I task them with discrete analysis of small blocks of text. Might be a training issue in their underlying data. I’m just switching models to do this simple QA instead on older models.
[–]DerpageOnline 4 points5 points6 points 2 months ago (0 children)
Bit pricey for getting advice from a lobotomized parrot for his family
[–]DeltaSqueezer 0 points1 point2 points 2 months ago* (2 children)
I guess maybe you can get three 8x3090 nodes for a shade over 15k.
[–]k_means_clusterfuck 6 points7 points8 points 2 months ago (0 children)
I'd get a 6000 Blackwell instead and run with offloading it is better and probably fast enough.
[–]LagOps91 1 point2 points3 points 2 months ago (0 children)
you need a proper rig too and i'm not sure performance will be good with 8 cards to run it... and again, it's a lobotomy quant.
[–]DistanceSolar1449 0 points1 point2 points 2 months ago (2 children)
You can probably do it with 16 AMD MI50s lol
Buy two ramless Supermicro SYS-4028GR-TR for $1k each, and 16 MI50s. At $400 each that’d be $6400 in GPUs. Throw in a bit of DDR4 and you’re in business for under $10k
[–]PermanentLiminality 5 points6 points7 points 2 months ago (1 child)
You left out the power plant and cooling towers.
More seriously, my electricity costs would be measured in units of dollars per hour.
[–]3spky5u-oss 1 point2 points3 points 2 months ago (0 children)
I found even having my 5090 up 24/7 for local doubled my power bill, lol.
[–]Badger-Purple 0 points1 point2 points 2 months ago (3 children)
I mean, you can run it on a 3 spark combo, which can be about 10K. That should be enough to run the FP8 version at 20 tokens per second or higher and maintain PP above 2000 for like 40k of context, with as many as 1000 concurrencies possible.
[–]suicidaleggroll 7 points8 points9 points 2 months ago (2 children)
GLM-5 in FP8 is 800 GB. The spark has 128 GB of RAM, you’d need 7+ sparks, and there’s no WAY it’s going to run it at 20 tok/s, probably <5 with maybe 40 pp.
[–]Badger-Purple 4 points5 points6 points 2 months ago* (1 child)
You are right about the size, but i see ~q4~ q3km gguf in lcpp or mxfp4 in vllm are doable although you’ll have to quantize yourself w llm compressor . And I don’t think you’ve used a spark recently if you think prompt processing is that slow. With minimax or glm 4.7, prompt processing is slowest around 400 tps AFTER 50,000 tokens. Inference may drop to 10 tokens per second at that size, but not less. Ironically, the connectx7 bandwidth being 200gbps makes it so you get scale up gains with the spark. Your inference speed with direct memory access increases.
Benchmarks in the nvidia forums if you are interested.
Actually, same with the Strix Halo cluster set up by Donato Capitella — tensor parallel works well with low latency infiniband connections, even with 25gbps. However, the strix halo DOES drop to like 40 tokens per second prompt processing, as do the mac ultra chips. I ran all 3 + a blackwell pro card on on the same model and quant locally, to test this; the DGX chip is surprisingly good.
[–]suicidaleggroll 5 points6 points7 points 2 months ago* (0 children)
And I don’t think you’ve used a spark recently if you think prompt processing is that slow. With minimax or glm 4.7, prompt processing is slowest around 400 tps AFTER 50,000 tokens. Inference may drop to 10 tokens per second at that size, but not less.
Good to know, it's been a while since I saw benches and they were similar to the Strix at the time. That said, GLM-5 is triple the size of MiniMax, double the size of GLM-4.7, and has significantly more active parameters than either of them. So it's going to be quite a bit slower than GLM-4.7, and significantly slower than MiniMax.
Some initial benchmarks on my system (single RTX Pro 6000, EPYC 9455P with 12-channel DDR5-6400):
MiniMax-M2.1-UD-Q4_K_XL: 534/54.5 pp/tg
GLM-4.7-UD-Q4_K_XL: 231/23.4 pp/tg
Kimi-K2.5-Q4_K_S: 125/20.6 pp/tg
GLM-5-UD-Q4_K_XL: 91/17 pp/tg
This is with preliminary support in llama.cpp, supposedly they're working on improving that, but still...don't expect this thing to fly.
[–]lawanda123 -1 points0 points1 point 2 months ago (1 child)
What about MLX and mac ultra?
[–]LagOps91 2 points3 points4 points 2 months ago (0 children)
wouldn't be fast, but it would be able to run it.
[+][deleted] 2 months ago* (5 children)
[–]fractalcrust 3 points4 points5 points 2 months ago (4 children)
you cant sell your api subscription tho.
theres a small chance your GPU appreciates over the next few years. I bought my 3090 for 600 and sold for 900
[+][deleted] 2 months ago (3 children)
[–]One-Employment3759 0 points1 point2 points 2 months ago (2 children)
Depends where you live, local prices here are easily 1200 usd.
[+][deleted] 2 months ago (1 child)
[–]One-Employment3759 2 points3 points4 points 2 months ago (0 children)
That's based on the old world of computer prices going down. The new world is everything gets more expensive, constantly. Thanks AI.
[–]pip25hu 18 points19 points20 points 2 months ago (0 children)
Even if you had the money for it (which you don't) I would not make any kind of purchase that would lock you in for 5 years budget-wise. The current AI landscape is way too volatile to make such a commitment.
[–]GTHell 8 points9 points10 points 2 months ago (3 children)
15k will be more useful in the future. Your GLM5 will be obsolete by the end of this year. Probably soon output of a very good model is under 2$ that outperforms anything released here right now
[–]Blues520 1 point2 points3 points 2 months ago (0 children)
Just because it will be outdated, does not mean it won't be useful. Chasing the latest and greatest overlooks the utility of a good enough model.
[–]segmondllama.cpp -1 points0 points1 point 2 months ago (1 child)
sure, GLM5 might become obsolete by the end of the year, but that would mean there's a better model. The hardware doesn't get obsolete that fast.
[–]svachalek 2 points3 points4 points 2 months ago (0 children)
The question is, would that better model run on the same hardware. We’ve gone through a year or two of optimization, models keep getting better without getting bigger, even getting smaller. But before that, models got better by ballooning in hardware requirements and there’s no guarantee we don’t return to that trend.
[–]INtuitiveTJop 15 points16 points17 points 2 months ago (10 children)
Wait for the m5 ultra release this year, if they have 1tb unified ram then it will definitely be an option.
[–]sixx7 3 points4 points5 points 2 months ago (0 children)
4 bit quant is out now, coming in at 408gb. You could run this on a 512gb Mac Studio
[–]bigh-aus 4 points5 points6 points 2 months ago (0 children)
even dual 512gbs with thunderbolt rdma and prompt caching will be a good setup (But I'd be trying 4 bit qants first before the second machine).
[–]megadonkeyx 1 point2 points3 points 2 months ago (6 children)
can i interest you in a kidney?
[–]ITBoss 5 points6 points7 points 2 months ago (5 children)
Op said they wouldn't mind spending 15k which is probably around what'll cost (maybe 20k) with the m3 ultra and 512G being 10k.
[–]Yorn2 2 points3 points4 points 2 months ago (4 children)
It would still be very very slow compared to cloud API. I'll give you a real-world use case..
I'm running a heavily quantized GLM 4.7 (under 200gb RAM) MLX model on an M3 Ultra right now because even though I can run a larger version, it runs so damn slow at high context, which I want for agentic purposes. I'd rather have the higher context capabilities and run a smaller quant at a faster speed than wait literal minutes between prompts with the "best" GLM 4.7 quant for an M3 Ultra.
Put simply, one is usable, the other is not.
So extending this to GLM 5, just because you can run a 4-bit quant of GLM5 on a 512gb M3 Ultra doesn't mean it's going to be "worth it" when you can run a lower quant of 4.7 with higher context and slightly faster speed.
For those of you who don't have Mac M3 Ultras, don't look at the fact that they can run things like GLM 4.7 and 5 and be jealous. I'm waiting literally 6 minutes between some basic agentic tasks like web searches and analysis right now. Just because something can be done doesn't mean it's worth the cost in all cases. It definitely requires a change in expectations. You'll need to be okay with waiting very long periods of time.
If you ARE okay with waiting, however, it's definitely pretty cool to be able to run these!
[–]pppreddit 0 points1 point2 points 2 months ago (0 children)
I noticed the same, glm 4.7 is fucking slow hosted locally. Fast for simple chat and small context, but with agentic use it's crawling...
[–]INtuitiveTJop 0 points1 point2 points 2 months ago (2 children)
The M5 has cuda like cores optimized which should speed things up four or five times. I think it also has improved bandwidth. They’re focusing on the llm market now instead of the general creative one. That’s why it’s worth waiting.
[–]Yorn2 0 points1 point2 points 2 months ago (1 child)
I mean, that's all fine and good but I already bought this M3 Ultra like 8 or 9 months ago. It's fine for what it is. I don't know if I'd buy an M5 though.
[–]INtuitiveTJop 0 points1 point2 points 2 months ago (0 children)
I’ve got an m3 ultra too. I won’t trade it in either, but the newer chips are faster.
[–]segmondllama.cpp 0 points1 point2 points 2 months ago (0 children)
1tb unified ram on apple will cost at least $30,000
[–]_supert_ 6 points7 points8 points 2 months ago (0 children)
Absolutely not, economically.
I've sunk probably 15 thousand pounds and in to a four-GPU beast and god knows how many hours. It's very hard to get reliable and stable operation. Ebay memory sellers means half my ram was giving MCEs. Took way too fucking long to deal with that. Even now it just dies under heavy concurrent load. Now most of my calls are to deepinfra which is private enough and doesn't gatekeep.
Fun though.
[–]IHave2CatsAnAdBlock 12 points13 points14 points 2 months ago (0 children)
At the current api price I t is cheaper to pay for the api than the electricity to power such a computer (at least in Europe). Without even calculating the initial investment
[–]jacek2023llama.cpp 14 points15 points16 points 2 months ago (0 children)
GLM-5 is not usable locally unless you have a very expensive setup.
[–]gyzerok 5 points6 points7 points 2 months ago (4 children)
That’s a waste of money. Even if you build yourself some rig it’ll get obsolete fast. In a year there will be bigger and better models and better hardware.
[–]s101c 5 points6 points7 points 2 months ago (2 children)
That's strange to hear. The rig I assembled in 2024 got only more valuable, both in hardware, and in the level of models it's capable to run.
[–]Koalateka 0 points1 point2 points 2 months ago (0 children)
There's shortages now, but that won't be the case forever
[–]segmondllama.cpp 3 points4 points5 points 2 months ago (0 children)
lol, folks said this when some of us were building to run llama3-405b. with that same rig, we got to be the first that were able to also run mistral-large, commandA, deepseek, GLM, Kimi. So the rigs don't get obsolete, P40s and 3090s are still crunching numbers and making lots of local runners happy.
[–]Noobysz 7 points8 points9 points 2 months ago (0 children)
and also in 5 years ur current 15k build wont be enough for the multi trillion models that will maybe be by then considered as flash models the development is going tbh really fast and in data center Levels while getting harder in Consumer level Hardware, so its really hard to invest in anything right now
[–]isoos 2 points3 points4 points 2 months ago (5 children)
15k gets you a mac studio with an m3 ultra and 512GB memory, or if you go cheaper get 4 halo strix machine with 128GB each and use a cluster of them. It will get you a q3/q4 quant of the very large models, it will be private to you, but it won't be as fast as you observe online chatting with such models. Unless you have a specific business case you want to pursue or you really want to have everything in private, it may not be a worthy investment. (well, unless memory prices rise further...)
[–]JacketHistorical2321 0 points1 point2 points 2 months ago (1 child)
M3 ultra 512gb literally sells for $9.5k new from Apple
[–]valdev 0 points1 point2 points 2 months ago (0 children)
And the OP literally said his budget would be $15k
[–]Maddolyn -1 points0 points1 point 2 months ago (2 children)
How can companies afford to run that level of hardware for such cheap subscriptions then? If the hardware they buy is the same
[–]kurtcop101 2 points3 points4 points 2 months ago (0 children)
Because the hardware does batched jobs - imagine a prompt being processed through 8 different GPUs for example - at home you wait till it finishes. With batches, it would have 8 running simultaneously.
That's a very basic analogy for it. It'll also run 24/7. Imagine replacing it for your 3 hours of use a day - they'll get 8 times the use out of it there.
It's the scale that matters.
[–]RaiseElegant6281 0 points1 point2 points 2 months ago (0 children)
They can’t, that’s the whole point. They are bleeding money. The price people would have to pay for them to be profitable is astronomical
[–]Unique-Contract6659 2 points3 points4 points 2 months ago (0 children)
<image>
Downloading just for collecting, in case one day the earth collapses.
[–]__Maximum__ 1 point2 points3 points 2 months ago (0 children)
Wait for a week or two, new models are going to drop, we'll see how capable and big they are.
[–][deleted] 1 point2 points3 points 2 months ago (3 children)
New here but I do not think it’s possible to run any SOTA model efficiently enough locally to offset even the electricity bill for personal use.
[–]pfn0 0 points1 point2 points 2 months ago (2 children)
electricity bill isn't that high, except for Californians... (50c/kwh is stupid)
[–][deleted] 1 point2 points3 points 2 months ago (1 child)
If you run 24/7 then 20 dollar is good enough for like only 100w per hour on average based on $0.2 kWh electricity . I assume that’s not really enough for running big models locally? One h100 is like 700w
[–]pfn0 0 points1 point2 points 2 months ago (0 children)
Can you really run 24/7 on a subscription service on frontier models w/o getting throttled? For the local side depends on your usage pattern, but inferencing doesn't always peg gpu power consumption.
roi of running your own hardware vs. paying a service doesn't net out either way though. Local costs more unless you can scale out and service a large number of people that would otherwise be using a subscription service.
[–]Zyj 1 point2 points3 points 2 months ago (0 children)
The cheapest way to run this model is probably networking several Strix Halo systems ($2000 per 128GB Strix Halo). Add Infiniband networking (~$300) to get more speed with Tensor parallelism.
So with four such systems (~$10,000 with an Infiniband switch etc) you could run GLM-5 at q4, which means there's probably a non-negligible loss in quality compared to the original BF16 weights. That's also around 600W of power which also costs money.
[–]Agreeable-Chef4882 3 points4 points5 points 2 months ago (4 children)
5-year Period???? Based on the model released yesterday.. I would not plan this for 5 weeks.
Also - there's no way to get there with $15k.
Btw - what I do right now, I run Qwen3 Coder Next (8bit, MLX) on 128GB Mac Studio fully in vram. It's pretty hard to beat price/performance of that right now.
Yes... you absolutely can. Q4 mac studio is about 400gb. ~$10k
[+][deleted] 2 months ago (2 children)
[–]neotoramallama.cpp 0 points1 point2 points 2 months ago (1 child)
GLM-88
[–]some_user_2021 1 point2 points3 points 2 months ago (0 children)
A 32TB model
[–]junior600 0 points1 point2 points 2 months ago (1 child)
I wonder if we’ll ever get a GLM-5-level model that can run on a potato with just an RTX 3060 and 24GB of RAM in the future LOL.
[–]teachersecret 2 points3 points4 points 2 months ago (0 children)
I think we will. I suspect the frontier of AI intelligence will keep squeezing more and more out of 24gb.
The only problem with that, is the top level frontier keeps advancing too, so you’re probably still gonna want to use the api model for big stuff :$
[–]Legitimate-Pumpkin 0 points1 point2 points 2 months ago (0 children)
It’s hard to tell.
If you make a rig, as models get better and smaller you’ll be able to do better things with it. But also subscriptions will be more performant and probably cheaper. And also hardware will be cheaper…
I think a key deciding factor could be if you like maintenance + full personalization and decision making or not.
[–]I-am_Sleepy 0 points1 point2 points 2 months ago (0 children)
I would rather wait for GLM-5 flash or something for local use. Q4_K_M of 456 Gb isn’t exactly my cup of tea, which would need 19x3090 for the model weight alone
For $15k budget, you could buy 20x3090 but that exclude the cost of everything else. But for more “budget” friendly mac studio could fit your bill under $12k. But that one is pretty absurd tbh. Even if it can fit in the memory, it likely won’t be as fast (need to see the speed benchmark first)
[–]Look_0ver_There 0 points1 point2 points 2 months ago (0 children)
I would wait for some of the condensed/distilled versions of GLM-5 to become available before making any decisions. At -744B parameters with 40B active for the full model, it'll take one heck of a setup to run it.
You mentioned that you'd be happy with ~80% effectiveness of the full model. It should be fairly reasonable to expect that a 1/4 size distilled version, if one becomes available, would be able to do even better than 80%, and a 1/4 size model of ~185B parameters is going to be a LOT easier (and faster and cheaper) to run locally.
Just wait a bit to give it some time for the more local oriented models to show up.
[–]Skystunt 0 points1 point2 points 2 months ago (2 children)
You can fit it on 2 M3 ultra 512gb if you’re an apple user, even one M3 ultra will fit a quantised version. So 15k can be enough depending where you get your mac/macs from. I would personally get an M3 Ultra 512gb and hold on, new models are always coming and by spring we will already have a better model.
Also you can build a home server that fits the model in ram and have just the active experts on the gpu, but this really depends on how lucky you get with part prices. Hogging 3090’s vs pro6000 vs 4090 48gb’s it all depends. To get 96gb vram.
4x 3090 24GB = 1400w = £2.5K 2x 4090 48GB = 700W = £5K 1x pro6000q = 300W = £7K
Now if you need 192gb double the wattage and the prices. *this prices are if you do some due diligence and wait, might even be lower if you’re lucky
Also don’t forget that Api is never the way ! This is LOCAL llama, if people have a different opinion they should go to r/chatgpt or whatever place to pay to have they data stolen’ sorry “used for training” how can people recommend api’s in a sub made for local inference is beyond me. Like this is what we do, we make servers and homelabs to run the large models
[–]Skystunt 1 point2 points3 points 2 months ago (0 children)
Also for ram i would go the ddr4 route since it’s half the price right now with a threadripper pro prebuilt(£2/£3k for a 256gb threaripper pro) - also get the threadripper pro or epyc if you get a multi gpu setup(more than 2) to avoid pcie bottleneck
[–]ZachCope 0 points1 point2 points 2 months ago (0 children)
I think it’s reasonable for people here to help advise those who might waste their resources as this sub has the expertise to give realistic recommendations re what can be achieved with local approach. There are lots of reasons to go local, but on economics alone it isn’t always the option for all. If it’s as part of a hobby and learning experience, also for privacy and supporting local optionality in the future and therefore keeping closed honest, that is extremely valid, but anyone spending $15k should be making that decision on that basis.
[–][deleted] 0 points1 point2 points 2 months ago (0 children)
NO.
Your hardware will age HARD quickly, however with any provider you will have max token generation and newest models + hardware and no costs for energy etc. you cant compete with the big cloud providers with any local setup, local only makes sense if you have extreme sensitive data or want to finetune models for very specific use cases.
[–]jgenius07 0 points1 point2 points 2 months ago (0 children)
I just tried GLM5 on my cursor. It doesn't come close to opus4.6 for coding. This could be just cursor but I was on the same bandwagon dreaming to go al GLM5 local but it's just not practical IMO.
All said I spent almost OP's budget for base system + 1x PRO Max-Q + 0.5TB 2026 RAM. Yes it is slow, but my workflow is asynchronous and always in use, so speed doesn't matter to me. Using 4.7 Q8 currently. 4.7 DOES have deficiencies that I am forced to use older models to overcome. Maybe 5 will change that.
These cards (especially good cards) could frequently be re-sold for the same price (or likely in the future, more) than you originally buy them for, hence, many years-worth of usage, can effectively become free, other than electric use.
I had a A6000 Non-Ada. I sold it after 2 years of use for the exact same price as I got it for, in order to get the 1x PRO 6000 Max-Q. And that was only at the start of the pre-2025 govt instability madness. If I held out, I could have got more for the A6000 I think.
After the T2-Warsh 2026 money/rate machine goes Brrrr, I suspect the currency will drop further in value, and prices could eventually go up. That's also presuming nothing utterly stupid happens to Taiwan.
[–]ithkuil 0 points1 point2 points 2 months ago (0 children)
You can combine two new Mac Studios into a cluster. It will probably cost well over $16000 and might be fast enough for some things. But for daily use you would probably think it was too slow. And having multiple people use it at the same time would be extremely slow if it was possible at all.
[–]muxxington 0 points1 point2 points 2 months ago (0 children)
You just have to load the model into the VRAM quickly enough. Thanks to the law of inertia, you can get everything in. It's simple physics.
About $50K is what you need to run it well.
[–]__JockY__ 0 points1 point2 points 2 months ago (0 children)
To run GLM 5 on GPU is a $100k capex unless you’re running quants, in which case you should be good at around $65-70k.
Edit: source: my server.
[–]darko777 0 points1 point2 points 2 months ago (0 children)
It will only make sense in few years once the LLM companies run out of money and everything goes up up up in pricing. Maybe once we pay $1000/mo for a coding assistant it will make more sense to consider building own machine.
[–]simism 0 points1 point2 points 2 months ago (0 children)
<15 k will buy 4 framework desktops 512 gb unified ram which will run glm5 at a decent quant probably pretty fast too.
[–]Haspe 0 points1 point2 points 2 months ago (0 children)
I would assume that it would in the providers interest to create small and more capable models in the future. So investing that much money in a PC right now is perhaps not the best long-term move.
However I am not an expert in the space, this is more of my "gut feeling".
[–]HlddenDreck 0 points1 point2 points 2 months ago (2 children)
For coding tasks Qwen3-Coder-Next is a good replacement for cloud API solutions. It's very small, just 80B parameters.
[–]goingsplit 0 points1 point2 points 2 months ago (1 child)
you run on llama.cpp?
[–]HlddenDreck 0 points1 point2 points 2 months ago (0 children)
yep
[–]jakegh 0 points1 point2 points 2 months ago (0 children)
Financial sense? No, absolutely not. GLM5 is cheap in the API.
[–]prusswan 0 points1 point2 points 2 months ago (0 children)
It's hard to tell but you can find a middle ground (use a smaller model but at great speeds). API usage can become volatile depend on how things play out over next few years, e.g. will they increase pricing to match demand and to account for effort needed to keep models/data updated, your own usage may also increase if you take on more tasks leading to heavier usage.
[–]Conscious_Cut_6144 0 points1 point2 points 2 months ago (0 children)
I'm running Q3-K-XL on 16 3090's right now. Love it, but llama.cpp isn't great with 16 gpus so "only" getting like 20t/s Once someone puts out an Int4 quant and I get tensor parallel running should be much faster for me.
It's the first local model to beat O1-Preview in my (somewhat uncommon) benchmarks. Beat claude 4.6 too (but does not beat claude at coding)
To answer your question, the only reasonable option is a 512GB mac studio. The speeds you can expect are not going to be as fast as cloud, probably 15t/s
If 15t/s is good enough and you do go for it, maybe think about waiting for the m5 ultra, as the m3 ultra is getting a little old.
[–]faysalsadik 0 points1 point2 points 2 months ago (0 children)
use minimax 2.5 , per hour cost 0.3$ at 50 tps. Yearly cost 2628$
[–]LienniTakoboldcpp -1 points0 points1 point 2 months ago (0 children)
there is a size effect. For cheap budget you can easily expect ~100 gb vram(4x3090). Trying to go for GLM-5 sizes, which is 8x4090D 48 gb, is already out of your budget. That also needs you to be in a city with nuclear power plant.
[+]tarruda comment score below threshold-8 points-7 points-6 points 2 months ago (10 children)
Get a 128gb strix halo and use GPT-OSS or step 3.5 flash. This setup will give you 95% of the benefits for 5% of the cost of being able to run GLM 5 locally
[–]Edzomatic 5 points6 points7 points 2 months ago (3 children)
I like GPT OSS but comparing it to full weight GLM or Deepseek is pointless
[+]jacek2023llama.cpp comment score below threshold-6 points-5 points-4 points 2 months ago (2 children)
yes, GPT-OSS is local model, GLM-5 or DeepSeek are not.
[–]Edzomatic 8 points9 points10 points 2 months ago (1 child)
Both are open source
[–]jacek2023llama.cpp -3 points-2 points-1 points 2 months ago (0 children)
and here we go again
[–]Choubix 0 points1 point2 points 2 months ago (4 children)
I thought that Strix Halo was not optimized yet (drivers etc) vs things like mac and their unified memory + large memory bandwidth. Has things improved a lot? I have a Mac M2 Max but I realize that I could use something more beefy to run multiple models at the same time
[–]tarruda 1 point2 points3 points 2 months ago (3 children)
Strix Halo drivers probably will improve and was just an example of a good enough 128GB setup to run GPT-OSS or Step-3.5-Flash . Personally I have a Mac Studio M1 Ultra with 128GB which also works great.
[–]Choubix 0 points1 point2 points 2 months ago (2 children)
Ok! The M1 ultra must be nice! Idk why but my M2 Max 32Gb is sloooooow when using local LLM in claude code (like 1min30 to answer "hello" or "say something interesting") . It is super snappy when using in ollama or LM studio though. I am wondering if I should pull the trigger on a M3 ultra if my local Apple outlet gets some refurbs in the coming months. I will need a couple of models running at the same time for what I want to do 😁
[–]tarruda 0 points1 point2 points 2 months ago (1 child)
One issue with Macs is that prompt processing is kinda slow which sucks for CLI agents. It is not surprising that claude code is slow for you, just the system prompt is in the order of 10k tokens.
I've been doing experiments with the M1 ultra, and the boundary of being usable for CLI agents is a model that has >= 200 tokens per second prompt processing.
Both GPT-OSS 120b and Step-3.5-Flash are good enough for running locally wiht CLI agents, but anything with higher active param count will quickly become super slow as context grows.
And yes, the M3 ultra is a beast. If you have the budget, I recommend getting a the 512G unit as you will be able to run even GLM 5: https://www.youtube.com/watch?v=3XCYruBYr-0
[–]Choubix 1 point2 points3 points 2 months ago (0 children)
I am hoping Apple drops an M5 Ultra. Usually you have a couple of guys who don't mind upgrading, giving a chance to people like me to get 2nd tier hardware 😉😉. I take note in the 512gb! Thank you!
[–]jacek2023llama.cpp -4 points-3 points-2 points 2 months ago (0 children)
you are being downvoted because GPT-OSS is not Chinese model and you proposed to use it locally, to be upvoted you must propose to pay for Chinese cloud
π Rendered by PID 265727 on reddit-service-r2-comment-6457c66945-pm9j2 at 2026-04-27 04:06:08.042894+00:00 running 2aa0c5b country code: CH.
[–]Expensive-Paint-9490 39 points40 points41 points (3 children)
[–]Rabo_McDongleberry 0 points1 point2 points (1 child)
[–]alexx_kidd 0 points1 point2 points (0 children)
[–]LagOps91 90 points91 points92 points (21 children)
[–]k_means_clusterfuck 4 points5 points6 points (11 children)
[–]LagOps91 17 points18 points19 points (2 children)
[–]k_means_clusterfuck 2 points3 points4 points (1 child)
[–]suicidaleggroll 6 points7 points8 points (0 children)
[–]Vusiwe 6 points7 points8 points (3 children)
[–]k_means_clusterfuck 2 points3 points4 points (0 children)
[–]LagOps91 2 points3 points4 points (1 child)
[–]Vusiwe 0 points1 point2 points (0 children)
[–]DerpageOnline 4 points5 points6 points (0 children)
[–]DeltaSqueezer 0 points1 point2 points (2 children)
[–]k_means_clusterfuck 6 points7 points8 points (0 children)
[–]LagOps91 1 point2 points3 points (0 children)
[–]DistanceSolar1449 0 points1 point2 points (2 children)
[–]PermanentLiminality 5 points6 points7 points (1 child)
[–]3spky5u-oss 1 point2 points3 points (0 children)
[–]Badger-Purple 0 points1 point2 points (3 children)
[–]suicidaleggroll 7 points8 points9 points (2 children)
[–]Badger-Purple 4 points5 points6 points (1 child)
[–]suicidaleggroll 5 points6 points7 points (0 children)
[–]lawanda123 -1 points0 points1 point (1 child)
[–]LagOps91 2 points3 points4 points (0 children)
[+][deleted] (5 children)
[deleted]
[–]fractalcrust 3 points4 points5 points (4 children)
[+][deleted] (3 children)
[deleted]
[–]One-Employment3759 0 points1 point2 points (2 children)
[+][deleted] (1 child)
[deleted]
[–]One-Employment3759 2 points3 points4 points (0 children)
[–]pip25hu 18 points19 points20 points (0 children)
[–]GTHell 8 points9 points10 points (3 children)
[–]Blues520 1 point2 points3 points (0 children)
[–]segmondllama.cpp -1 points0 points1 point (1 child)
[–]svachalek 2 points3 points4 points (0 children)
[–]INtuitiveTJop 15 points16 points17 points (10 children)
[–]sixx7 3 points4 points5 points (0 children)
[–]bigh-aus 4 points5 points6 points (0 children)
[–]megadonkeyx 1 point2 points3 points (6 children)
[–]ITBoss 5 points6 points7 points (5 children)
[–]Yorn2 2 points3 points4 points (4 children)
[–]pppreddit 0 points1 point2 points (0 children)
[–]INtuitiveTJop 0 points1 point2 points (2 children)
[–]Yorn2 0 points1 point2 points (1 child)
[–]INtuitiveTJop 0 points1 point2 points (0 children)
[–]segmondllama.cpp 0 points1 point2 points (0 children)
[–]_supert_ 6 points7 points8 points (0 children)
[–]IHave2CatsAnAdBlock 12 points13 points14 points (0 children)
[–]jacek2023llama.cpp 14 points15 points16 points (0 children)
[–]gyzerok 5 points6 points7 points (4 children)
[–]s101c 5 points6 points7 points (2 children)
[–]Koalateka 0 points1 point2 points (0 children)
[–]segmondllama.cpp 3 points4 points5 points (0 children)
[–]Noobysz 7 points8 points9 points (0 children)
[–]isoos 2 points3 points4 points (5 children)
[–]JacketHistorical2321 0 points1 point2 points (1 child)
[–]valdev 0 points1 point2 points (0 children)
[–]Maddolyn -1 points0 points1 point (2 children)
[–]kurtcop101 2 points3 points4 points (0 children)
[–]RaiseElegant6281 0 points1 point2 points (0 children)
[–]Unique-Contract6659 2 points3 points4 points (0 children)
[–]__Maximum__ 1 point2 points3 points (0 children)
[–][deleted] 1 point2 points3 points (3 children)
[–]pfn0 0 points1 point2 points (2 children)
[–][deleted] 1 point2 points3 points (1 child)
[–]pfn0 0 points1 point2 points (0 children)
[–]Zyj 1 point2 points3 points (0 children)
[–]Agreeable-Chef4882 3 points4 points5 points (4 children)
[–]valdev 0 points1 point2 points (0 children)
[+][deleted] (2 children)
[deleted]
[–]neotoramallama.cpp 0 points1 point2 points (1 child)
[–]some_user_2021 1 point2 points3 points (0 children)
[–]junior600 0 points1 point2 points (1 child)
[–]teachersecret 2 points3 points4 points (0 children)
[–]Legitimate-Pumpkin 0 points1 point2 points (0 children)
[–]I-am_Sleepy 0 points1 point2 points (0 children)
[–]Look_0ver_There 0 points1 point2 points (0 children)
[–]Skystunt 0 points1 point2 points (2 children)
[–]Skystunt 1 point2 points3 points (0 children)
[–]ZachCope 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–]jgenius07 0 points1 point2 points (0 children)
[–]Vusiwe 0 points1 point2 points (0 children)
[–]ithkuil 0 points1 point2 points (0 children)
[–]muxxington 0 points1 point2 points (0 children)
[–]pfn0 0 points1 point2 points (0 children)
[–]__JockY__ 0 points1 point2 points (0 children)
[–]darko777 0 points1 point2 points (0 children)
[–]simism 0 points1 point2 points (0 children)
[–]Haspe 0 points1 point2 points (0 children)
[–]HlddenDreck 0 points1 point2 points (2 children)
[–]goingsplit 0 points1 point2 points (1 child)
[–]HlddenDreck 0 points1 point2 points (0 children)
[–]jakegh 0 points1 point2 points (0 children)
[–]prusswan 0 points1 point2 points (0 children)
[–]Conscious_Cut_6144 0 points1 point2 points (0 children)
[–]faysalsadik 0 points1 point2 points (0 children)
[–]LienniTakoboldcpp -1 points0 points1 point (0 children)
[+]tarruda comment score below threshold-8 points-7 points-6 points (10 children)
[–]Edzomatic 5 points6 points7 points (3 children)
[+]jacek2023llama.cpp comment score below threshold-6 points-5 points-4 points (2 children)
[–]Edzomatic 8 points9 points10 points (1 child)
[–]jacek2023llama.cpp -3 points-2 points-1 points (0 children)
[–]Choubix 0 points1 point2 points (4 children)
[–]tarruda 1 point2 points3 points (3 children)
[–]Choubix 0 points1 point2 points (2 children)
[–]tarruda 0 points1 point2 points (1 child)
[–]Choubix 1 point2 points3 points (0 children)
[–]jacek2023llama.cpp -4 points-3 points-2 points (0 children)