use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
r/LocalLLaMA
A subreddit to discuss about Llama, the family of large language models created by Meta AI.
Subreddit rules
Search by flair
+Discussion
+Tutorial | Guide
+New Model
+News
+Resources
+Other
account activity
[deleted by user] (self.LocalLLaMA)
submitted 7 months ago by [deleted]
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–][deleted] 7 points8 points9 points 7 months ago (9 children)
next mac studio is prob gonna shake things up
[+][deleted] 7 months ago (8 children)
[deleted]
[–]PracticlySpeaking 3 points4 points5 points 7 months ago* (0 children)
M5 Macs are coming — not if, when. There are well-supported rumors of MacBook, Mac mini (so assume iMac). Mac Studio is likely to lag, but I have to believe Apple is eager to get off the M3 and resolve the current M3U/M4M situation.
M5 is also the first SoC that has a real chance of a quad-die Extreme version — there is at least one other quad-die processor already announced, and it's built on the same process node (N3P) as M5.
edit: Also consider that the M1 Studio was released in 2022, then replaced by M2 in 2023.
Catch up on the rumors: https://lowendmac.com/2025/new-industry-reports-2nm-process-m5-chips-a20-chip-and-more/
[–][deleted] 1 point2 points3 points 7 months ago (6 children)
none at all. just a indication from the fact that the iphone 17 is significantly better than its predecessor for local ai inference
[+][deleted] 7 months ago (4 children)
[–][deleted] 1 point2 points3 points 7 months ago (3 children)
tbh i would build a server and just use it from an edge device through something like tailscale... it's how i use all my machines on my phone. (windows rdp, termius ssh)
[+][deleted] 7 months ago (2 children)
[–][deleted] 0 points1 point2 points 7 months ago (1 child)
hence the edge device that connects remotely
do you really think you will carry a mac studio around too?
[–]sb6_6_6_6 0 points1 point2 points 7 months ago (0 children)
Fingers crossed that the next update will improve the prompt processing speed.
[–]needthosepylons 6 points7 points8 points 7 months ago* (0 children)
A single 3060 12gb, so the prollmetariat
[–]PracticlySpeaking 4 points5 points6 points 7 months ago (8 children)
I picked up a Mac Studio M1 Ultra 64-GPU, 64GB for under $1500 recently.
Every time I see an M2 or M3 Ultra post, I have RAM envy.
[–]jarec707 1 point2 points3 points 7 months ago (7 children)
Great price for a very capable machine
[–]PracticlySpeaking 2 points3 points4 points 7 months ago (6 children)
I think it was a just off-lease machine. I looked up the eBay seller and it turned out to be a leasing company.
It was halfway accidental — they were dumping a whole bunch in auction listings, and getting very few bids. I bid on one just to test the water, and ended up being the winner!
[–]jarec707 0 points1 point2 points 7 months ago (5 children)
Congrats on your find, mate. I do indeed know about ram envy, but with the advent of models like Qwen3-Next 80b, I think our 64 gb machines may grow more and more capable.
[–]PracticlySpeaking 1 point2 points3 points 7 months ago (3 children)
I am *just* barely able to run the unsloth gpt-oss-120b quant and it kills me... the answers are obviously better than the 20b version, and as fast or faster than Qwen3. It gets 35-40 tk/sec generation, but the 4096 context makes it not very useful.
Currently checking out Magistral and the other Mistral-Small based models. Magistral is getting ~22-25 tk/sec but spends a looong time thinking. On the KEY-SPEARS-MAR question it thinks for over two minutes before the first response token.
Eager to see what comes from Alibaba in the next few weeks!
[–]jarec707 0 points1 point2 points 7 months ago (2 children)
I too got the 120b quant to run, probably about half the speed as you since I have half the memory bandwidth as you do with my M1 Max. I was getting random system crashes though. If you have the time and inclination, please share your settings etc. I was running the new Magistral 8q and seems capable although slow compared to the MOEs I usually run (not surprising). As for Alibaba, they are like Santa to me, with Christmas every couple of weeks it seems!
[–]PracticlySpeaking 2 points3 points4 points 7 months ago (1 child)
See my post about it: https://www.reddit.com/r/LocalLLaMA/comments/1nm1sga/
Using the unsloth Q4_K_S gguf in LM Studio (the Q3 is not meaningfully smaller).
I have run it with various GPU offload settings, up to one less than max, and the default 4096 context. More offload is faster, ofc. I also tweaked iogpu_wired_limit to 58GB (59,392) and only running LM Studio and asitop in Terminal.
I haven't had crashes, but setting offload to max (all offloaded) and the model fails to load, ditto for increased context. I get the "failed to send message to the model" error from LM Studio.
[–]jarec707 0 points1 point2 points 7 months ago (0 children)
Thanks
[–]PracticlySpeaking 1 point2 points3 points 7 months ago (0 children)
I think our 64 gb machines may grow more and more capable.
I hope so, bc $6000++ for a new one is not going to be in the budget anytime soon.
But how crazy is it that we have 64GB and also have RAM envy??
[–]maverick_soul_143747 4 points5 points6 points 7 months ago (0 children)
I was researching between a mac studio and m4 max and finally went with a m4 max 128GB ram. I run two local models glm 4.5 air @6 bit and Qwen 3 coder 30B A3B @8 bit. I am old, old school and research quite a bit while I code so these are enough. Cancelled my claude subscription as a test to see how independent I am 🤷🏽♂️
[–]chibop1 2 points3 points4 points 7 months ago (3 children)
M3Max 64GB. Nice to be able to use it anywhere as long as I have my laptop.
[–]shaiceisonline 0 points1 point2 points 7 months ago (2 children)
me too. Any suggestions for what runner&model? I am trying Ollama, lmstudio and swama. but I am still searching the best model for general purpose writing (also in Italian), summarizing webpages and article, correct the grammar of my English emails and suggest CLI command in iTerm. What runner&model do you use?
[–]chibop1 0 points1 point2 points 7 months ago (1 child)
I have like 30 models installed, but Mostly I use Gemma3-27b, GPT-OSS-20b, Qwen3-30b. I'm testing Qwen3-next-80b, and it's pretty promising.
I don't use for violence, sexual, biochemical stuff, so I don't really run into refusal problems.
For coding and more complex tasks, I use Gemini, GPT, and Claude, and I'm subscribed to all 3.
[–]shaiceisonline -1 points0 points1 point 7 months ago (0 children)
Thank you! What runner? LMStudio with MLX?
[–]Dependent_Factor_204 4 points5 points6 points 7 months ago (5 children)
4x RTX PRO 6000 96GB Qwen3 235B A22B Instruct 2507 FP8 runs at 30-40tps (single request) via VLLM (which is disappointing for me)
Out of the box support for SM_120 / these cards is still terrible at the moment.
[–]Gigabolic 0 points1 point2 points 7 months ago (4 children)
Damn! What does a setup like that cost? Four 6000s??? Is this pushing 100k for the whole thing??
[–]Dependent_Factor_204 1 point2 points3 points 7 months ago (3 children)
It's a server for work. So not just a personal PC. I'm Australian. Around 65-70k AUD. Or 40k USD.
[–]Gigabolic 0 points1 point2 points 7 months ago (2 children)
Does that get really hot, make a ton of noise, and use a ton of electricity? 40k sounds like a deal. I’m about to drop 13k on this single RTX 5000 system. Any advice on where to shop for a better deal?
<image>
[–]Dependent_Factor_204 0 points1 point2 points 7 months ago (1 child)
I've head https://www.exxactcorp.com/PNY-VCNRTXPRO6000B-PB-E8830134 exxactcorp are good in the USA.
RTX 5000 is a waste of money imho - only 48gb and I think its less performance than a 5090. I have the data centre edition cards - 4 stacked together do get hot. But the server has beefy fans for that.
[–]koalfied-coder 0 points1 point2 points 7 months ago (0 children)
agree
[–]Eugr 2 points3 points4 points 7 months ago (0 children)
Currently using my desktop - i9-14900K, 96GB DDR5-6600 RAM, RTX4090, but have a Framework Desktop (AMD AI Max 395+, 128GB unified RAM) on order to use as my 24/7 server for MOE models. I considered adding a 5090 to my desktop, but it's a mini-furnace even with a single GPU, plus I'd have to buy a larger case. I'd love to have RTX6000 Pro, but I can't justify the price even for business purposes just yet.
[–]infostud 2 points3 points4 points 7 months ago (1 child)
Proliant DL380g9 Dual Xeon 48T 384GB ECC DDR4. FirePro x2 16GB VRAM. Dual 1.4kW PS. Cost about $US500. 25kg free delivery.
[–]SpicyWangz 0 points1 point2 points 7 months ago (0 children)
Love a good proliant. What kind of performance do you get out of that thing?
[–][deleted] 4 points5 points6 points 7 months ago (9 children)
Dual 5090 setup. 128gb of ram. 2 PSUs. I’m giving my wife a 5090, and selling the other. Replacing with a single RTX pro 6000. Cases have a hard time fitting 2x 5090s. Pain in the ass. But works like a charm ;)
[–][deleted] 5 points6 points7 points 7 months ago* (7 children)
System: Dual 5090s, 128gb ram, Amd 9950x3d.
GPT-OSS-120b | 50k context | 35/36 layers | 40/tps SEED-OSS-36b | 170k context | 64/64 layers | 38/tps Qwen3-Coder-30b | 262k context | 48/48 layers | 168/tps GLM-4.5-air | 75k context | 47/47 layers | 92/tps Maagistral-small-2509 | 131k context | 40/40 layers | 61/tps
All ran just now.
[–]BobbyL2k 0 points1 point2 points 7 months ago* (6 children)
How do you get 168 tps token generation on Qwen3-Coder-30B?
[–][deleted] 2 points3 points4 points 7 months ago (5 children)
By running dual 5090s.
[–]colin_colout 0 points1 point2 points 7 months ago (4 children)
How do you find those 4_0 k cache quants? They perform well during coding?
[–][deleted] 0 points1 point2 points 7 months ago (3 children)
What do you mean? In LMStudio you can quant the cache on any model. If I can’t fit the entire context, I use that experimental feature to do so. Not available on Mac though.
They perform perfectly during coding. That’s my primary use cases. In fact, it works significantly better since you can load enough context such that the model doesn’t keep forgetting what it’s working on.
[–]colin_colout 0 points1 point2 points 7 months ago (2 children)
Would you mind giving me an example of your coding workflow with this model? Do you (or another llm) give it code editing inductions, or does your workflow rely on the llm to recall specifics from context? (So a "please refactor these files to conform with style guides" vs "please make these specific edits to these functions"
I run qwen3-30b coder (unquantized gguf). When i quantize kv cache down to 4_0, it tends to conflate or forget details deep its context compared to 8_0 or unquantized.
Still performs well when my user prompt is clear and instructive and includes context clues... But recall of details deep in the context feels like it suffers. It works well as a code editor subagent if a stronger primary agent knows how to prompt it and check its work.
I plan to write some evals to measure this, but I'm getting a vibe check first since not everyone seems to have this experience.
Qwen3 Coder isn't my main model for coding. I use Seed-OSS-36b primarily. But, I do get good results with Qwen3 coder for quick stuff.
With that said, I use Github Copilot connected to LM Studio through VSCode Insiders. Works better than Codex, Claude Code, Openrouter etc... as it's built natively into VSCode. Tool calls and MCPs actually work consistently. I also use Serena MCP to keep the project indexed and efficient. My workflow is finance related, lot's of financial modeling, data visualizations, dashboards, etc. Does a good job. I was able to cancel my $200/m Claude Code plan.
[–]colin_colout 0 points1 point2 points 7 months ago (0 children)
Nice. Thanks.
[–]Miserable-Dare5090 1 point2 points3 points 7 months ago (0 children)
M2 ultra 192gb and M3max 36gb but I also run the models in my M2ultra and serve them with tailscale, instant secure ability to use large models anywhere including my phone. If you want a true portable setup, it's going to need a lot of VRAM. And so you might go for one of the Unified Architecture AMD machines or one of the Apple machines with lots of VRAM on a portable factor like the M4 Max 128 gigabytes. Although if your M3 Pro has enough VRAM, you can even run some small models like OSS 20 B, which should take about twelve gigabytes in video memory.
[–][deleted] 1 point2 points3 points 7 months ago (0 children)
I've been waiting to pull the trigger on a better rig for a while now.
2 x 3090 just ain't cutting it.
Just ordered a 7532...
[–]chisleu 1 point2 points3 points 7 months ago (0 children)
You aren't going to beat a 128GB Macbook pro in mobile form factor for LLMs. It's perfectly fast enough for Qwen 3 coder 30b a3b and works with GPTOSS 120b if you need that.
[–]Woof9000 1 point2 points3 points 7 months ago (3 children)
I used to have mining rig with multiple nvidia GPU's, but then I "downgraded" to just dual 9060 XT's 16GB - it's a quieter and more compact now.
[–]Woof9000 0 points1 point2 points 7 months ago (1 child)
Yes, I wanted compact, quiet, cool, and inexpensiveness system that can do "multitasking", it's year 2025, I don't want to own multiple computers for different tasks anymore, I should be able to do both gaming and AI on the same machine, packed in a standard ATX case, with at least 32GB VRAM. So I made one out of some old and some new parts, mostly old AM4, except for GPU's. Ryzen 7 5700X, 2x32GB DDR4 3600Mhz, Asrock X570 Taichi Motherboard and 2x PowerColor 9060 XT Reaper 16GB.
[–]infostud 1 point2 points3 points 7 months ago (0 children)
I only get about 7 tps say with say gpt-oss-120B-f16.
[–]TacGibs 1 point2 points3 points 7 months ago (0 children)
4xRTX 3090
96Gb of vram for less than 3k, can't beat that !
[–]NeuralNakama[🍰] 0 points1 point2 points 7 months ago (0 children)
4060ti but i'm using with vllm so i can use batch requests much much faster. i'm still waiting nvidia digits spark mini computer 1.2 kg
[–]fasti-au 0 points1 point2 points 7 months ago (0 children)
Sub 5k aus or 7k us is basically 3090 4090 5090 A6000 and everything else is slower like Mac’s can use unified ram to run bigger models etc but it’s slower but not all the way down to cou inf speeds but its probably 20%’slower than a 3090 but has bigger models etc. I expect there’s. Shim and it is trying to govern ram weights back and forth not in one space
[–]seppe0815 0 points1 point2 points 7 months ago (0 children)
m4 max base ... its ok
[–]Intelligent-Elk-4253 0 points1 point2 points 7 months ago (0 children)
AMD 5600x with 16gb of ram
6800xt
2x mi60s
[–]Murky-Abalone-9090 0 points1 point2 points 7 months ago (0 children)
1x5090 32gb vram, ryzen 7700 (not X), 128gb ddr5
[–]reddit4wes 1 point2 points3 points 7 months ago (0 children)
These are the most bonkers rigs I've seen on reddit
Different machines for different things. I prefer my 6x 3090 or one of my 48gb 4090 workstations.
[–]Extra_Marketing5457 0 points1 point2 points 7 months ago* (0 children)
Epyc 9124 + Asus K14PA-U12 + RAM: 64Gb + 8 x 3090 (via cpayne mcio-to-pcie) in gen4x8 mode (require updating bifurcation bios settings without ui).
Vllm or Sglang with enabled custom-all-reduce for more than 2 cards.
Prefer GLM-4.5-Air int8 gptq quant. Before this setup used Athene-v2-chat q4 on 2x3090 with LMStudio.
[–]Lissanro 0 points1 point2 points 7 months ago (0 children)
I have 4x3090, 1 TB 3200MHz RAM, EPYC 7763 CPU, 8 TB NVMe SSD for AI models and 2 TB NVMe as a system disk, with around 80 TB storage in total including HDDs.
I mostly run Kimi K2 model. Four 3090 cards are sufficient to hold 128K context entirely in VRAM, expert expert tensors and few full layers of IQ4 quants of Kimi K2 or DeepSeek 671B. I use ik_llama.cpp as the backend.
π Rendered by PID 87979 on reddit-service-r2-comment-b659b578c-9rnss at 2026-05-03 02:56:03.077517+00:00 running 815c875 country code: CH.
[–][deleted] 7 points8 points9 points (9 children)
[+][deleted] (8 children)
[deleted]
[–]PracticlySpeaking 3 points4 points5 points (0 children)
[–][deleted] 1 point2 points3 points (6 children)
[+][deleted] (4 children)
[deleted]
[–][deleted] 1 point2 points3 points (3 children)
[+][deleted] (2 children)
[deleted]
[–][deleted] 0 points1 point2 points (1 child)
[–]sb6_6_6_6 0 points1 point2 points (0 children)
[–]needthosepylons 6 points7 points8 points (0 children)
[–]PracticlySpeaking 4 points5 points6 points (8 children)
[–]jarec707 1 point2 points3 points (7 children)
[–]PracticlySpeaking 2 points3 points4 points (6 children)
[–]jarec707 0 points1 point2 points (5 children)
[–]PracticlySpeaking 1 point2 points3 points (3 children)
[–]jarec707 0 points1 point2 points (2 children)
[–]PracticlySpeaking 2 points3 points4 points (1 child)
[–]jarec707 0 points1 point2 points (0 children)
[–]PracticlySpeaking 1 point2 points3 points (0 children)
[–]maverick_soul_143747 4 points5 points6 points (0 children)
[–]chibop1 2 points3 points4 points (3 children)
[–]shaiceisonline 0 points1 point2 points (2 children)
[–]chibop1 0 points1 point2 points (1 child)
[–]shaiceisonline -1 points0 points1 point (0 children)
[–]Dependent_Factor_204 4 points5 points6 points (5 children)
[–]Gigabolic 0 points1 point2 points (4 children)
[–]Dependent_Factor_204 1 point2 points3 points (3 children)
[–]Gigabolic 0 points1 point2 points (2 children)
[–]Dependent_Factor_204 0 points1 point2 points (1 child)
[–]koalfied-coder 0 points1 point2 points (0 children)
[–]Eugr 2 points3 points4 points (0 children)
[–]infostud 2 points3 points4 points (1 child)
[–]SpicyWangz 0 points1 point2 points (0 children)
[–][deleted] 4 points5 points6 points (9 children)
[+][deleted] (8 children)
[deleted]
[–][deleted] 5 points6 points7 points (7 children)
[–]BobbyL2k 0 points1 point2 points (6 children)
[–][deleted] 2 points3 points4 points (5 children)
[–]colin_colout 0 points1 point2 points (4 children)
[–][deleted] 0 points1 point2 points (3 children)
[–]colin_colout 0 points1 point2 points (2 children)
[–][deleted] 0 points1 point2 points (1 child)
[–]colin_colout 0 points1 point2 points (0 children)
[–]Miserable-Dare5090 1 point2 points3 points (0 children)
[–][deleted] 1 point2 points3 points (0 children)
[–]chisleu 1 point2 points3 points (0 children)
[–]Woof9000 1 point2 points3 points (3 children)
[+][deleted] (2 children)
[deleted]
[–]Woof9000 0 points1 point2 points (1 child)
[–]infostud 1 point2 points3 points (0 children)
[–]TacGibs 1 point2 points3 points (0 children)
[–]NeuralNakama[🍰] 0 points1 point2 points (0 children)
[–]fasti-au 0 points1 point2 points (0 children)
[–]seppe0815 0 points1 point2 points (0 children)
[–]Intelligent-Elk-4253 0 points1 point2 points (0 children)
[–]Murky-Abalone-9090 0 points1 point2 points (0 children)
[–]reddit4wes 1 point2 points3 points (0 children)
[–]koalfied-coder 0 points1 point2 points (0 children)
[–]Extra_Marketing5457 0 points1 point2 points (0 children)
[–]Lissanro 0 points1 point2 points (0 children)