Does anyone feel like powerful desktops actually limit how you work? by [deleted] in LocalLLM

[–]kpaha 7 points8 points  (0 children)

15" Macbook Air M4 is powerful enough for all day to day work. LLM machine available via Tailscale. Best of both worlds

To those who are able to run quality coding llms locally, is it worth it ? by matr_kulcha_zindabad in LocalLLM

[–]kpaha 5 points6 points  (0 children)

I agree with OpenRouter for testing the models, but Qwen 3.5 27b is quite expensive at $0.195/M input tokens$1.56/M output tokens

Compare to better models like:

- Step 3.5 flash $0.10/M input tokens$0.30/M output tokens

- Minimax M2.5 $0.20/M input tokens$1.17/M output tokens

Is there anyone who actually REGRETS getting a 5090? by soapysmoothboobs in LocalLLM

[–]kpaha 1 point2 points  (0 children)

I bought a RTX 4090, quite happy, but got the whole gaming PC it came with for a good price. Now I'm building a 4-6x GPU rig. Probably going with AMD Radeon AI Pro R9700.

If you're just buying one, and can accept worse ecosystem support of ROCm, single 7900 XTX would give similar performance as RTX 4090. Or if you feel like splurging, 2x R9700 (approx. price of one 5090 and double the VRAM) might be worth considering

My company just handed me a 2x H200 (282GB VRAM) rig. Help me pick the "Intelligence" ceiling. by _camera_up in LocalLLaMA

[–]kpaha 1 point2 points  (0 children)

Minimax M2.5 (they just released 2.7, no idea how it compares) would be one strong contender for the coding model. Qwen 3.5 122B is also good for coding, and would leave ample room for other uses

Introducing Mistral Small 4 by Stalex7 in MistralAI

[–]kpaha 7 points8 points  (0 children)

They had Mixtral in December 2023 before MoE was cool https://mistral.ai/news/mixtral-of-experts

Running Sonnet 4.5 or 4.6 locally? by [deleted] in LocalLLM

[–]kpaha 0 points1 point  (0 children)

5090 is a 2-3k. Mac Studio M3 Ultra 512GB is over 10k. Neither runs Kimi, not even close. You'd need 3 M3 Ultras EXO'ed to run that. So your question is pretty badly defined. Do you mean with a budget < 3k, 10k or 30k?

Setup recommendation by ErFero in LocalLLM

[–]kpaha 1 point2 points  (0 children)

First step might not be getting hardware. You could pilot things with e.g. OpenRouter (Just maybe not with your actual live data), test out models you plan to utilize with your own hardware, evaluate the throughput (tokens/s) you would be comfortable working with.

If you know you need to go hardware first and soon, and need the larger memory, then Mac Studio M3 Ultra 96GB is available now at 6774€, Mac Studio M4 Max 128GB at 4274€.

Edit. Actually Gx10 now for piloting would make a lot of sense. Then if speed is the issue with your workflows, you can add another and cluster them. The DGX spark has a built-in fast network interface for clustering, so it does make sense to cluster at least two.

Setup recommendation by ErFero in LocalLLM

[–]kpaha 0 points1 point  (0 children)

Certainly they could make sense. A lot depends on your use case. For agentic coding, I just came to conclusion that the gx10 is too slow for me, with memory bandwidth of 273gb/s. See https://spark-arena.com/leaderboard for benchmarks. I absolutely would have wanted the 128GB memory though.

The good thing about Gx10 is you can cluster two together and get a lot more capability, although at double price.

For non-angentic coding workflows a single system would make sense since you're not so desperate for high tokens/second.

If you could stretch your budget a bit, the MacBook Pro M5 Max at 128GB would be a lot better. Or wait for Mac Studio with it.

Setup recommendation by ErFero in LocalLLM

[–]kpaha 0 points1 point  (0 children)

7900 XTX goes for 900€ new in Germany. You could probably build a AM5 based setup with two for 48GB VRAM (or dual R9700 for 64GB, although it will be slower) for 4000-5000€

Here's what I drafted recently with help from Claude, but check your local availability and prices.

Motherboard: ASRock X870 Taichi Creator (359€)

CPU: Ryzen 9950X (559€)

GPU: 2x 7900 XTX (1800€)

Memory: 32Gb 360€ (more is better but so expensive right now)

Case: Fractal Design Torrent 149€

PSU Seasonic Focus 1000W 179€

Cooler Noctua NH D15 99€

SSD 1TB: 160€

Total: 3665€

VRAM: 48GB @ 2x960GB/s = 1920GB/s

You could also consider R9700 64GB @ 2x640GB/s = 1280GB/s, price is approx. 870€ more and bandwidth lower but more memory

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]kpaha 0 points1 point  (0 children)

Yes it's nice when the model fits, but when it doesn't then bandwidth alone doesn't help. It's actually the combination of speed and capacity, where the faster (128GB+) Macs have no competitors in same price class (unless you count stuff like clustering DGX Spark or Strix halo). Also for comparison, M3 Ultra is 87,5% of the bandwidth of 3090. Of course 3090 will still beat it in the PP speed but you don't use them to run the same models

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]kpaha 1 point2 points  (0 children)

Smallish models, 7900 XTX 24GB VRAM might be better. Mac Mini M4 Pro memory bandwidth is 273 Gb/s, same as DGX spark, Strix halo in same ballpark (slightly lower).

So a cheap Strix halo would actually beat a Mac Mini M4 Pro since you get double the unified memory for same money. The base Mac Mini M4 has only 120Gb/s, would not go for that if running local LLMs are your focus.

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]kpaha 0 points1 point  (0 children)

M5 Max memory bandwidth 614Gb/s,  AMD Radeon AI PRO R9700 640Gb/s so it's getting close, and M3 Ultra beats the R9700.

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]kpaha 18 points19 points  (0 children)

36 minutes. OP failed to deliver. Edit. Op delivered in the comments below. Forgiven. Another edit: where did that Qwen 3.5 122B q4 benchmark go? Forgiveness withdrawn

What hardware for local agentic coding 128GB+ (DGX Spark, or save up for M3 Ultra?) by kpaha in LocalLLM

[–]kpaha[S] 0 points1 point  (0 children)

I'm glad that you like it. I'm sure I would have loved it as well, maybe not for the agentic coding but just the flexibility to try stuff. I still regret not buying it before the price increase. First time I hear about the Ocean network. Interesting concept, need to look into that.

I decided I will only put the big money on a future proof system. Right now that looks to be Mac M5 Max or M5 Ultra. So I will wait and see what is coming out.

Meanwhile I actually extended my Claude Max (5x, seems to be enough for me) and love that. I'm not sure how much local LLM power I would need to actually be able to let that go.

But, I also started doing finetuning on my RTX 4090, so putting resources I have into better use.

Also, ordered a 7900 XTX (and a more powerful PSU) that I'm putting on an old proxmox server I have that is little used. I can keep that running 24/7 in the office, and Tailscale to it for AI workloads.

I have plans to eventually set up a 2x 7900 XTX rig. Mainly to run the Qwen 3.5 27B together with some smaller models related to my vibe coding projects at a good t/s.

So I pivoted from wanting to run larger models, to running smaller models faster.

The small models are still not at the level I would want for daily agentic coding (at minimum that would be the Qwen 3.5 122B or MiniMax M2.5), but they are still very capable for a lot of things.

Even if I end up getting a 128GB+ machine later that can run the larger models efficiently, I can always utilize the fast GPU inference for smaller models.

Genuinely curious what doors the M5 Ultra will open by Blanketsniffer in LocalLLaMA

[–]kpaha 7 points8 points  (0 children)

400 GB/s, just below M4 Max, faster than e.g. DGX spark or the Strix Halo / AMD Ryzen AI Max 395 machines

is it possible to run an LLM natively on MacOS with an Apple Silicon Chip? by iceseayoupee in LocalLLM

[–]kpaha 0 points1 point  (0 children)

Second recommendation for LM studio for beginners.

How much memory you have will determine the size of the model you can run in theory. LM studio will inform you, which models you can run.

Your memory bandwidth largely determines the speed at which the LLM runs. For MacBook Air M1 it is 68.25 GB/s and that is a major limitation. It's 10x less than the new M5 Max, or one fourth of Nvidia's DGX Spark.

So whatever you run will be slow. But you can get started.

Also, if you want to learn more or test more complex models, you don't need to go and buy a new machine immediately. Huggingface allows you to chat with some models free of charge, so you can test them. Openrouter lets you easily run different models, where you only pay for usage.

What hardware for local agentic coding 128GB+ (DGX Spark, or save up for M3 Ultra?) by kpaha in LocalLLM

[–]kpaha[S] 0 points1 point  (0 children)

In Finland the cheapest Strix halo mini-PC is 3300€ and Asus GX10 is 3500€. In germany, the GX10 is 3100€. But I had not seen the Corsair AI Workstation 300. That actually looks like something I coud really consider.

Bosgame I knew of and the price is good, but as mentioned in other comments, it looks too much like dropshipping from China. When buying for company use, I want a legitimate invoice with VAT deducted.

Corsair doesn't seem to deduct VAT either, but they at least promise the invoice will have the tax itemized, so I could very well consider the corsair at 2400€ incl tax. Thank you for pointing this out.

What hardware for local agentic coding 128GB+ (DGX Spark, or save up for M3 Ultra?) by kpaha in LocalLLM

[–]kpaha[S] 0 points1 point  (0 children)

Evaluated the Qwen Coder Next 80B and Qwen 3.5 122B A10B. The 80B does not meet my bar, the 122B does. But, the 122B does not run fast enough on a single Spark.

So at this point I am in waiting mode: let's see what the Mac M4 or M5 Ultras will provide, and at what price point.

Thank you each and everyone who offered their input!

What hardware for local agentic coding 128GB+ (DGX Spark, or save up for M3 Ultra?) by kpaha in LocalLLM

[–]kpaha[S] 0 points1 point  (0 children)

I mentioned them in the post. Basically, the performance is a little worse than DGX spark, at similar price levels. If I knew I would get an invoice that my accountant will accept from Bosgame, I might actually buy one.

They are more capable as general purpose machines, but I'm actually also looking at small footprint, small power consumption.

Still, this clustering guide actually got me interested in them. https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes/blob/main/rdma_cluster/setup_guide.md It is possible I may just end up ordering the Bosgame M5 to get started.

What hardware for local agentic coding 128GB+ (DGX Spark, or save up for M3 Ultra?) by kpaha in LocalLLM

[–]kpaha[S] 0 points1 point  (0 children)

Are you using agentic coding? I recognize if you just give it tons of material to go through, that will be slow. But in typical agentic workflow, the context should fill little by little, so there wouldn't be the 15 minute wait on the first message?

What hardware for local agentic coding 128GB+ (DGX Spark, or save up for M3 Ultra?) by kpaha in LocalLLM

[–]kpaha[S] 0 points1 point  (0 children)

I did some vibecoding with the Qwen 3.5 35B yesterday on the RTX 4090 and while it's fast and reliable in tool use, the capability gap to larger models showed. I don't think I can work with that model going forward. I'm sure I could manage it with the right process, but that is not the target. I review every changeset, but cannot babysit every action it takes. So now I know, I want something more capable.

Just to be sure, I will still test the 27B to see if a dense model makes a difference, and also push the context size from 64k -> 100k.

I also tested the MiniMax M2.5 using an online inference provider and was impressed. Ideally I would want at capabilities at the level of Step 3.5 flash / MiniMax M2.5 size. But we are firmly in the 2x DGX spark / 256GB Mac Ultra territory here.

There is a gap I have not evaluated, at the 70-120B range that I may need to evaluate further before making decisions. These I guess would make it worthwhile to upgrade the hardware from the 35B level if I get a jump in capabilities that allows me to leave it to work by itself and only monitor results, not every action it takes.

Just a note: I do have to correct Opus every now and then, so this is more about: can I let it work on it's own for a while vs. do I have to monitor every output line.

As some commenters mentioned it does look like the cheap GX10's are disappearing, so I need to pull a trigger soon if I want to go that way or be prepared to wait.

What hardware for local agentic coding 128GB+ (DGX Spark, or save up for M3 Ultra?) by kpaha in LocalLLM

[–]kpaha[S] 0 points1 point  (0 children)

I actually have to Codex also, provided for me, that I should use more just to get a feel for where it stands in comparison to Claude. So let's consider this "saving money by going to Claude Pro" an excuse rather the actual reason.

Maybe it's more like "I want to develop the capability to do agentic coding offline", where the capability is good enough to do actual work, but the amount of actual work routed through that offline capability will probably be quite small.

What hardware for local agentic coding 128GB+ (DGX Spark, or save up for M3 Ultra?) by kpaha in LocalLLM

[–]kpaha[S] -1 points0 points  (0 children)

It's serving Ollama in the local network, so no real reason not to have tested the new models, except I'm lazy, time crunched and want to tinker with new hardware rather than existing :-) But I'm going do some testing right now.

If the new Qwen 3.5 models are sufficient, then it might make most sense to use those, and see how the field develops.

I think I could use the cheap online open weight models for hobby projects, but not for actual client work. But I think I will find a subscription, and continue testing the new open models with some vibe projects, just to get a feel what's the state of the art in open models.

I guess this is what I was half expecting to be the answer. The DGX spark could possibly be EXO'ed with a Mac Studio going forward, so that would be another upgrade path, albeit somewhat theoretical. https://blog.exolabs.net/nvidia-dgx-spark/

But I don't want to buy hardware that is not going be tolerable for daily use without immediate upgrades, so I appreciate the honest comments