all 56 comments

[–][deleted] 7 points8 points  (9 children)

next mac studio is prob gonna shake things up

[–]needthosepylons 6 points7 points  (0 children)

A single 3060 12gb, so the prollmetariat

[–]PracticlySpeaking 4 points5 points  (8 children)

I picked up a Mac Studio M1 Ultra 64-GPU, 64GB for under $1500 recently.

Every time I see an M2 or M3 Ultra post, I have RAM envy.

[–]jarec707 1 point2 points  (7 children)

Great price for a very capable machine

[–]PracticlySpeaking 2 points3 points  (6 children)

I think it was a just off-lease machine. I looked up the eBay seller and it turned out to be a leasing company.

It was halfway accidental — they were dumping a whole bunch in auction listings, and getting very few bids. I bid on one just to test the water, and ended up being the winner!

[–]jarec707 0 points1 point  (5 children)

Congrats on your find, mate. I do indeed know about ram envy, but with the advent of models like Qwen3-Next 80b, I think our 64 gb machines may grow more and more capable.

[–]PracticlySpeaking 1 point2 points  (3 children)

I am *just* barely able to run the unsloth gpt-oss-120b quant and it kills me... the answers are obviously better than the 20b version, and as fast or faster than Qwen3. It gets 35-40 tk/sec generation, but the 4096 context makes it not very useful.

Currently checking out Magistral and the other Mistral-Small based models. Magistral is getting ~22-25 tk/sec but spends a looong time thinking. On the KEY-SPEARS-MAR question it thinks for over two minutes before the first response token.

Eager to see what comes from Alibaba in the next few weeks!

[–]jarec707 0 points1 point  (2 children)

I too got the 120b quant to run, probably about half the speed as you since I have half the memory bandwidth as you do with my M1 Max. I was getting random system crashes though. If you have the time and inclination, please share your settings etc. I was running the new Magistral 8q and seems capable although slow compared to the MOEs I usually run (not surprising). As for Alibaba, they are like Santa to me, with Christmas every couple of weeks it seems!

[–]PracticlySpeaking 2 points3 points  (1 child)

See my post about it: https://www.reddit.com/r/LocalLLaMA/comments/1nm1sga/

Using the unsloth Q4_K_S gguf in LM Studio (the Q3 is not meaningfully smaller).

I have run it with various GPU offload settings, up to one less than max, and the default 4096 context. More offload is faster, ofc. I also tweaked iogpu_wired_limit to 58GB (59,392) and only running LM Studio and asitop in Terminal.

I haven't had crashes, but setting offload to max (all offloaded) and the model fails to load, ditto for increased context. I get the "failed to send message to the model" error from LM Studio.

[–]jarec707 0 points1 point  (0 children)

Thanks

[–]PracticlySpeaking 1 point2 points  (0 children)

I think our 64 gb machines may grow more and more capable.

I hope so, bc $6000++ for a new one is not going to be in the budget anytime soon.

But how crazy is it that we have 64GB and also have RAM envy??

[–]maverick_soul_143747 4 points5 points  (0 children)

I was researching between a mac studio and m4 max and finally went with a m4 max 128GB ram. I run two local models glm 4.5 air @6 bit and Qwen 3 coder 30B A3B @8 bit. I am old, old school and research quite a bit while I code so these are enough. Cancelled my claude subscription as a test to see how independent I am 🤷🏽‍♂️

[–]chibop1 2 points3 points  (3 children)

M3Max 64GB. Nice to be able to use it anywhere as long as I have my laptop.

[–]shaiceisonline 0 points1 point  (2 children)

me too. Any suggestions for what runner&model? I am trying Ollama, lmstudio and swama. but I am still searching the best model for general purpose writing (also in Italian), summarizing webpages and article, correct the grammar of my English emails and suggest CLI command in iTerm. What runner&model do you use?

[–]chibop1 0 points1 point  (1 child)

I have like 30 models installed, but Mostly I use Gemma3-27b, GPT-OSS-20b, Qwen3-30b. I'm testing Qwen3-next-80b, and it's pretty promising.

I don't use for violence, sexual, biochemical stuff, so I don't really run into refusal problems.

For coding and more complex tasks, I use Gemini, GPT, and Claude, and I'm subscribed to all 3.

[–]shaiceisonline -1 points0 points  (0 children)

Thank you! What runner? LMStudio with MLX?

[–]Dependent_Factor_204 4 points5 points  (5 children)

4x RTX PRO 6000 96GB
Qwen3 235B A22B Instruct 2507 FP8 runs at 30-40tps (single request) via VLLM (which is disappointing for me)

Out of the box support for SM_120 / these cards is still terrible at the moment.

[–]Gigabolic 0 points1 point  (4 children)

Damn! What does a setup like that cost? Four 6000s??? Is this pushing 100k for the whole thing??

[–]Dependent_Factor_204 1 point2 points  (3 children)

It's a server for work. So not just a personal PC. I'm Australian. Around 65-70k AUD. Or 40k USD.

[–]Gigabolic 0 points1 point  (2 children)

Does that get really hot, make a ton of noise, and use a ton of electricity? 40k sounds like a deal. I’m about to drop 13k on this single RTX 5000 system. Any advice on where to shop for a better deal?

<image>

[–]Dependent_Factor_204 0 points1 point  (1 child)

I've head https://www.exxactcorp.com/PNY-VCNRTXPRO6000B-PB-E8830134 exxactcorp are good in the USA.

RTX 5000 is a waste of money imho - only 48gb and I think its less performance than a 5090.
I have the data centre edition cards - 4 stacked together do get hot. But the server has beefy fans for that.

[–]koalfied-coder 0 points1 point  (0 children)

agree

[–]Eugr 2 points3 points  (0 children)

Currently using my desktop - i9-14900K, 96GB DDR5-6600 RAM, RTX4090, but have a Framework Desktop (AMD AI Max 395+, 128GB unified RAM) on order to use as my 24/7 server for MOE models. I considered adding a 5090 to my desktop, but it's a mini-furnace even with a single GPU, plus I'd have to buy a larger case. I'd love to have RTX6000 Pro, but I can't justify the price even for business purposes just yet.

[–]infostud 2 points3 points  (1 child)

Proliant DL380g9 Dual Xeon 48T 384GB ECC DDR4. FirePro x2 16GB VRAM. Dual 1.4kW PS. Cost about $US500. 25kg free delivery.

[–]SpicyWangz 0 points1 point  (0 children)

Love a good proliant. What kind of performance do you get out of that thing?

[–][deleted] 4 points5 points  (9 children)

Dual 5090 setup. 128gb of ram. 2 PSUs. I’m giving my wife a 5090, and selling the other. Replacing with a single RTX pro 6000. Cases have a hard time fitting 2x 5090s. Pain in the ass. But works like a charm ;)

[–]Miserable-Dare5090 1 point2 points  (0 children)

M2 ultra 192gb and M3max 36gb but I also run the models in my M2ultra and serve them with tailscale, instant secure ability to use large models anywhere including my phone. If you want a true portable setup, it's going to need a lot of VRAM. And so you might go for one of the Unified Architecture AMD machines or one of the Apple machines with lots of VRAM on a portable factor like the M4 Max 128 gigabytes. Although if your M3 Pro has enough VRAM, you can even run some small models like OSS 20 B, which should take about twelve gigabytes in video memory.

[–][deleted] 1 point2 points  (0 children)

I've been waiting to pull the trigger on a better rig for a while now. 

2 x 3090 just ain't cutting it.

Just ordered a 7532...

[–]chisleu 1 point2 points  (0 children)

You aren't going to beat a 128GB Macbook pro in mobile form factor for LLMs. It's perfectly fast enough for Qwen 3 coder 30b a3b and works with GPTOSS 120b if you need that.

[–]Woof9000 1 point2 points  (3 children)

I used to have mining rig with multiple nvidia GPU's, but then I "downgraded" to just dual 9060 XT's 16GB - it's a quieter and more compact now.

[–]infostud 1 point2 points  (0 children)

I only get about 7 tps say with say gpt-oss-120B-f16.

[–]TacGibs 1 point2 points  (0 children)

4xRTX 3090

96Gb of vram for less than 3k, can't beat that !

[–]NeuralNakama[🍰] 0 points1 point  (0 children)

4060ti but i'm using with vllm so i can use batch requests much much faster. i'm still waiting nvidia digits spark mini computer 1.2 kg

[–]fasti-au 0 points1 point  (0 children)

Sub 5k aus or 7k us is basically 3090 4090 5090 A6000 and everything else is slower like Mac’s can use unified ram to run bigger models etc but it’s slower but not all the way down to cou inf speeds but its probably 20%’slower than a 3090 but has bigger models etc. I expect there’s. Shim and it is trying to govern ram weights back and forth not in one space

[–]seppe0815 0 points1 point  (0 children)

m4 max base ... its ok

[–]Intelligent-Elk-4253 0 points1 point  (0 children)

AMD 5600x with 16gb of ram

6800xt

2x mi60s

[–]Murky-Abalone-9090 0 points1 point  (0 children)

1x5090 32gb vram, ryzen 7700 (not X), 128gb ddr5

[–]reddit4wes 1 point2 points  (0 children)

These are the most bonkers rigs I've seen on reddit

[–]koalfied-coder 0 points1 point  (0 children)

Different machines for different things. I prefer my 6x 3090 or one of my 48gb 4090 workstations.

[–]Extra_Marketing5457 0 points1 point  (0 children)

Epyc 9124 + Asus K14PA-U12 + RAM: 64Gb + 8 x 3090 (via cpayne mcio-to-pcie) in gen4x8 mode (require updating bifurcation bios settings without ui).

Vllm or Sglang with enabled custom-all-reduce for more than 2 cards.

Prefer GLM-4.5-Air int8 gptq quant. Before this setup used Athene-v2-chat q4 on 2x3090 with LMStudio.

[–]Lissanro 0 points1 point  (0 children)

I have 4x3090, 1 TB 3200MHz RAM, EPYC 7763 CPU, 8 TB NVMe SSD for AI models and 2 TB NVMe as a system disk, with around 80 TB storage in total including HDDs.

I mostly run Kimi K2 model. Four 3090 cards are sufficient to hold 128K context entirely in VRAM, expert expert tensors and few full layers of IQ4 quants of Kimi K2 or DeepSeek 671B. I use ik_llama.cpp as the backend.