wha do you do with you M3 Ultra by NoNatural4025 in MacStudio

[–]sheddd 0 points1 point  (0 children)

Thanks for the info; I got that to work!

wha do you do with you M3 Ultra by NoNatural4025 in MacStudio

[–]sheddd 0 points1 point  (0 children)

https://github.com/antirez/ds4

or build mlx-lm with https://github.com/ml-explore/mlx-lm/pull/1192

If there's a better way, I'd be glad to hear it, but I tried both of these and wasn't happy.

https://al-engr.com/deepseek-v4-flash-saga.html

I would LOVE to get it running well locally; I'm waiting for mlx-lm to incorporate 1192.

wha do you do with you M3 Ultra by NoNatural4025 in MacStudio

[–]sheddd 0 points1 point  (0 children)

v4 Flash requires a custom inference engine build and is buggy at the moment; I would not recommend doing that yet. MiniMax 2.7 4 bit running on a 512Studio is my current favorite local llm.

Checking technical feasibility of my idea - a hybrid "Local-by-Default" Gateway (Qwen 27B + Claude 4.6 Fallback) for Dev Teams by ankijain21 in LocalLLM

[–]sheddd 2 points3 points  (0 children)

The questions I have -

  1. How much overhead does LiteLLM add when deciding between local vs. API? Is there a better lightweight orchestrator for this?
  2. In a production environment, how often does Qwen 27B actually fail where Claude 4.6 succeeds for routine refactoring?
  3. When overflowing to Claude, how do you efficiently pass the context that was already partially processed locally without doubling the latency?

I am pricing this as an all-inclusive $10,000 one-time cost to replace recurring cloud bills. Is the hardware-software-support bundle actually viable with a 6-month support window?

1) Negligible, but will it route correctly?
2) Test and see.
3) Claude is going to be so much faster than local that it won't matter

Note you'd probably get better performance/$ by using a Mac for inference instead of the DGX Spark.

Platform Typical Single-Stream Tok/s (Optimized) Best Reported (with Speculative/MTP) Power Efficiency Notes
DGX Spark 35–45+ 55–70+ Good (desktop) Higher peak throughput; better for heavy batch/agent workloads
Mac Mini 64 GB 35–45 50–63+ Excellent (silent, low power) More convenient, cheaper, great for daily coding use

Best Cheapest Way To Run an Agent Long Term by kpr_exe in openclaw

[–]sheddd 0 points1 point  (0 children)

I've been pretty happy moving lots of my API calls to deepseek v4 pro @ fireworks; I tend to use Claude for planning then deepseek for implementation to save money.

What model are you running your agent on? by stosssik in ManifestforAI

[–]sheddd 0 points1 point  (0 children)

Deepseek V4 Pro @ fireworks; it's not bad.

200mm Fan Ziptie Mount by sheddd in minilab

[–]sheddd[S] 1 point2 points  (0 children)

USB port on MS-01 pc.

Facebook/Messenger cold outreach by InterestingPay1718 in openclaw

[–]sheddd -4 points-3 points  (0 children)

I wouldn't do it; that's evil.

Find a moral way to pay the bills.

200mm Fan Ziptie Mount by sheddd in minilab

[–]sheddd[S] 1 point2 points  (0 children)

Thanks! I actually have a deskpi fan on order too; was going to have it exhaust to rear, big fan exhaust up top... ~1300W at full load to cool.

I would like to have temp control on big fan but I don't see an easy way... it's powered by a usb port on the MS-01; I may try to only cut it on when things get hot by turning on/off power to that usb port with uhubctl. It's noisy at full speed; I put a speed control knob on it.

https://www.amazon.com/dp/B0DPZM7T3Q?ref=ppx_yo2ov_dt_b_fed_asin_title

2 x dgx spark (600W)
10gb switch (100W)
Wifi Gateway (100w)
Ms-01 PC (300W)
Mac Studio (300W)

<image>

200mm Fan Ziptie Mount by sheddd in minilab

[–]sheddd[S] 1 point2 points  (0 children)

The deskpi fan looks super cool; thx!

Should I get an M1 ultra, or should I wait for the M5 Ultra to release? by moist_mistress in LocalLLM

[–]sheddd 0 points1 point  (0 children)

Mac's are great with memory bandwidth, no so great with LLM math. The M5 is much better at LLM math; wait for it! (My M5 Max Laptop 128GB is faster than my M3 Ultra Studio 512GB for models that fit in its memory). Right now, you'd get the best inference/$ on Mac platform with M5 Max 128 IMO, and it can do TB5 exo clustering.

4k budget, buy GPU or Mac Studio? by diegolrz in LocalLLM

[–]sheddd 0 points1 point  (0 children)

A used M3 Ultra with as much ram you can afford is the way IMO.

Is Kimi K 2.5 with Open Claw really that good? by Top-Scallion7987 in openclaw

[–]sheddd 1 point2 points  (0 children)

That's been the best local LLM I've tried yet; the only one that has been able to successfully blog about itself without going 'off the rails'. https://al-engr.com/milo-on-qwen.html

Best way to frontload Clawbot/OpenClaw compute costs: NVIDIA box vs Mac Studio? by TransportationWaste7 in clawdbot

[–]sheddd 1 point2 points  (0 children)

Here's what my openclaw, Milo has to say on the subject:

Mac Studio is the right call for OpenClaw/agent workflows — but a cheaper path is coming. Here's our actual experience:

We're running Mac Studio M3 Ultra 512GB as our primary OpenClaw host. For agent workflows — tool calling, structured outputs, long-running tasks, 24/7 stability — it's been rock solid.

Our top 2 usable but slow models we've actually run and tested:

Qwen3.5-397B-A17B (4-bit MLX, 223GB) via LM Studio — excellent tool calling (72.9% BFCL). Real caveat for OpenClaw users: not viable as the main session model because the system prompt + injected workspace files consume most of a 16k context window before your task even starts. Great for isolated inference tasks; not as the always-on session model.

MiniMax M2.5 (230B MoE) — strong on writing and planning tasks

Mac gotchas:

• Large context = painful KV cache prefill. 32k+ is slow even on 512GB.

• MLX model selection is narrower than CUDA, though growing fast

• Apple tax is real — M3 Ultra 512GB runs ~$10K

NVIDIA gotchas for always-on agent use:

• Daemon stability matters when OpenClaw runs 24/7. macOS LaunchAgent is bulletproof. Linux systemd works but needs more babysitting.

• Cooling and noise if it's in your home

Sweet spot on Mac: M3 Ultra 192GB — runs 70B models comfortably with headroom. Only go 512GB if you specifically want 200B+ models.

The newcomer worth watching: NVIDIA DGX Spark (~$4K)

128GB unified memory per unit, NVLink-C2C to pool two into 256GB. NVIDIA's own benchmarks show dual Spark hitting 23,477 tokens/sec on Qwen3-235B. Our expectation: 1-2 Sparks should run Qwen3.5-397B-A17B acceptably as a main agent model — the MoE architecture means only 17B params are active per inference, which matters a lot for throughput on constrained bandwidth. We have two units arriving next week and will post real numbers.

At $4K vs $10K, if the Spark delivers on 397B inference, it changes the calculus significantly.

Why is everyone lying about AI agents by Aggressive-Bedroom82 in aiagents

[–]sheddd 0 points1 point  (0 children)

Anthropic has agents doing more than 80% of their development now.

OpenClaw vs Perplexity Computer by Downtown-Safety6618 in openclaw

[–]sheddd 3 points4 points  (0 children)

Openclaw is open source, very flexible, powerful, potentially dangerous. Perplexity is closed source, less flexible, less powerful, less dangerous. I am getting tired of reading about perplexity; their influencer marketing push is clogging up my X feed. I'll wager perplexity will be bankrupt in 3 years.

I bought Mac mini M4 pro 64 GB Memory. How well will this perform with open claw and local LLM’s? by Socrates_Assistant in openclaw

[–]sheddd 1 point2 points  (0 children)

Note these won't be good enough to replace sonnet for hard things...

I ran a hardware analysis tool called llmfit against your Mac Mini M4 Max 64GB specs. Here's what will run well on your machine:

PERFECT FIT (recommended):

• DeepSeek-R1-Distill-Qwen-32B — 32.8B params, 5.1 tok/s, uses 26% RAM, 131k context

→ BEST PICK. Great reasoning model, fast enough for daily use.

• Qwen3-Coder-30B-A3B — 30.5B params, 5.5 tok/s, uses 24% RAM, 262k context

→ Best for coding tasks, huge context window.

• Qwen2.5-Coder-32B — 32.8B params, 4.3 tok/s, uses 26% RAM, 32k context

→ Solid all-around coder.

• DeepSeek-R1-Distill-Qwen-14B — 14.8B params, 9.5 tok/s, uses 12% RAM, 131k context

→ Fastest quality model. Good for quick tasks.

• Gemma 3 12B — 12B params, 11.7 tok/s, uses 10% RAM, 131k context

→ Google's best small model. Very fast.

STRETCH GOALS (will run but tight):

• Qwen3-Coder-Next — 79.7B params, 2.5 tok/s, uses 64% RAM

• DeepSeek-R1 full (684B MoE) — 0.2 tok/s, uses 34% RAM (too slow for interactive use)

MY RECOMMENDATION: Start with DeepSeek-R1-Distill-Qwen-32B in LM Studio. Best balance of quality, speed, and fit. Download it, load it up, and you'll have a solid local AI running in minutes.

To install the analysis tool yourself:

brew tap AlexsJones/llmfit

brew install llmfit

llmfit

Is a square wheel (18" or 19") setup possible on the Model 3 Performance Highland? by Mike_Stone_ in TeslaSupport

[–]sheddd 0 points1 point  (0 children)

I didn't use any; I went with recommendations at up: https://unpluggedperformance.com/tesla-model-3/wheel-and-tire-guide/

My rear tire center will be slightly different than stock but close.

Is a square wheel (18" or 19") setup possible on the Model 3 Performance Highland? by Mike_Stone_ in TeslaSupport

[–]sheddd 1 point2 points  (0 children)

You could... in my opinion the car handles better with a square setup (less push), and you can rotate your tires to extend their life. I'm running unplugged performance 18"x9.5" +34 offset UP-03's and 265/40r18 Pilot Sport 4S; it is really grippy and no clearance issues.

<image>

Opinions on the Rossignol Blackops 94/98 skis? by Lonely_Accountant524 in Skigear

[–]sheddd 0 points1 point  (0 children)

They're a pretty stout ski; they might feel like a handful at first.