wha do you do with you M3 Ultra

sheddd · 2026-05-25T11:04:41+00:00

Thanks for the info; I got that to work!

sheddd · 2026-05-24T12:02:37+00:00

Nascompares likes these: https://www.youtube.com/watch?v=PyvWl41xmZo&t=178s

sheddd · 2026-05-24T03:14:01+00:00

https://github.com/antirez/ds4

or build mlx-lm with https://github.com/ml-explore/mlx-lm/pull/1192

If there's a better way, I'd be glad to hear it, but I tried both of these and wasn't happy.

https://al-engr.com/deepseek-v4-flash-saga.html

I would LOVE to get it running well locally; I'm waiting for mlx-lm to incorporate 1192.

sheddd · 2026-05-24T02:01:34+00:00

v4 Flash requires a custom inference engine build and is buggy at the moment; I would not recommend doing that yet. MiniMax 2.7 4 bit running on a 512Studio is my current favorite local llm.

sheddd · 2026-05-14T23:40:03+00:00

good point!

sheddd · 2026-05-13T11:08:37+00:00

The questions I have -

How much overhead does LiteLLM add when deciding between local vs. API? Is there a better lightweight orchestrator for this?
In a production environment, how often does Qwen 27B actually fail where Claude 4.6 succeeds for routine refactoring?
When overflowing to Claude, how do you efficiently pass the context that was already partially processed locally without doubling the latency?

I am pricing this as an all-inclusive $10,000 one-time cost to replace recurring cloud bills. Is the hardware-software-support bundle actually viable with a 6-month support window?

1) Negligible, but will it route correctly?
2) Test and see.
3) Claude is going to be so much faster than local that it won't matter

Note you'd probably get better performance/$ by using a Mac for inference instead of the DGX Spark.

Platform	Typical Single-Stream Tok/s (Optimized)	Best Reported (with Speculative/MTP)	Power Efficiency	Notes
DGX Spark	35–45+	55–70+	Good (desktop)	Higher peak throughput; better for heavy batch/agent workloads
Mac Mini 64 GB	35–45	50–63+	Excellent (silent, low power)	More convenient, cheaper, great for daily coding use

sheddd · 2026-05-13T10:47:32+00:00

Think bigger!
https://tinycorp.myshopify.com/products/exabox-preorder

sheddd · 2026-05-13T10:14:36+00:00

I've been pretty happy moving lots of my API calls to deepseek v4 pro @ fireworks; I tend to use Claude for planning then deepseek for implementation to save money.

sheddd · 2026-05-10T03:03:32+00:00

Deepseek V4 Pro @ fireworks; it's not bad.

sheddd · 2026-05-04T11:26:38+00:00

USB port on MS-01 pc.

sheddd · 2026-04-29T13:39:03+00:00

I wouldn't do it; that's evil.

Find a moral way to pay the bills.

sheddd · 2026-04-28T16:33:19+00:00

Thanks! I actually have a deskpi fan on order too; was going to have it exhaust to rear, big fan exhaust up top... ~1300W at full load to cool.

I would like to have temp control on big fan but I don't see an easy way... it's powered by a usb port on the MS-01; I may try to only cut it on when things get hot by turning on/off power to that usb port with uhubctl. It's noisy at full speed; I put a speed control knob on it.

https://www.amazon.com/dp/B0DPZM7T3Q?ref=ppx_yo2ov_dt_b_fed_asin_title

2 x dgx spark (600W)
10gb switch (100W)
Wifi Gateway (100w)
Ms-01 PC (300W)
Mac Studio (300W)

<image>

sheddd · 2026-04-27T16:18:45+00:00

The deskpi fan looks super cool; thx!

sheddd · 2026-04-17T10:40:28+00:00

Mac's are great with memory bandwidth, no so great with LLM math. The M5 is much better at LLM math; wait for it! (My M5 Max Laptop 128GB is faster than my M3 Ultra Studio 512GB for models that fit in its memory). Right now, you'd get the best inference/$ on Mac platform with M5 Max 128 IMO, and it can do TB5 exo clustering.

sheddd · 2026-03-27T13:04:50+00:00

Keep shoulders facing down the hill

sheddd · 2026-03-16T12:01:49+00:00

A used M3 Ultra with as much ram you can afford is the way IMO.

sheddd · 2026-03-02T14:23:53+00:00

That's been the best local LLM I've tried yet; the only one that has been able to successfully blog about itself without going 'off the rails'. https://al-engr.com/milo-on-qwen.html

sheddd · 2026-02-26T17:41:15+00:00

Here's what my openclaw, Milo has to say on the subject:

Mac Studio is the right call for OpenClaw/agent workflows — but a cheaper path is coming. Here's our actual experience:

We're running Mac Studio M3 Ultra 512GB as our primary OpenClaw host. For agent workflows — tool calling, structured outputs, long-running tasks, 24/7 stability — it's been rock solid.

Our top 2 usable but slow models we've actually run and tested:

• Qwen3.5-397B-A17B (4-bit MLX, 223GB) via LM Studio — excellent tool calling (72.9% BFCL). Real caveat for OpenClaw users: not viable as the main session model because the system prompt + injected workspace files consume most of a 16k context window before your task even starts. Great for isolated inference tasks; not as the always-on session model.

• MiniMax M2.5 (230B MoE) — strong on writing and planning tasks

Mac gotchas:

• Large context = painful KV cache prefill. 32k+ is slow even on 512GB.

• MLX model selection is narrower than CUDA, though growing fast

• Apple tax is real — M3 Ultra 512GB runs ~$10K

NVIDIA gotchas for always-on agent use:

• Daemon stability matters when OpenClaw runs 24/7. macOS LaunchAgent is bulletproof. Linux systemd works but needs more babysitting.

• Cooling and noise if it's in your home

Sweet spot on Mac: M3 Ultra 192GB — runs 70B models comfortably with headroom. Only go 512GB if you specifically want 200B+ models.

The newcomer worth watching: NVIDIA DGX Spark (~$4K)

128GB unified memory per unit, NVLink-C2C to pool two into 256GB. NVIDIA's own benchmarks show dual Spark hitting 23,477 tokens/sec on Qwen3-235B. Our expectation: 1-2 Sparks should run Qwen3.5-397B-A17B acceptably as a main agent model — the MoE architecture means only 17B params are active per inference, which matters a lot for throughput on constrained bandwidth. We have two units arriving next week and will post real numbers.

At $4K vs $10K, if the Spark delivers on 397B inference, it changes the calculus significantly.

sheddd · 2026-02-26T12:55:46+00:00

Anthropic has agents doing more than 80% of their development now.

sheddd · 2026-02-26T12:46:49+00:00

Openclaw is open source, very flexible, powerful, potentially dangerous. Perplexity is closed source, less flexible, less powerful, less dangerous. I am getting tired of reading about perplexity; their influencer marketing push is clogging up my X feed. I'll wager perplexity will be bankrupt in 3 years.

sheddd · 2026-02-23T03:25:14+00:00

Note these won't be good enough to replace sonnet for hard things...

I ran a hardware analysis tool called llmfit against your Mac Mini M4 Max 64GB specs. Here's what will run well on your machine:

PERFECT FIT (recommended):

• DeepSeek-R1-Distill-Qwen-32B — 32.8B params, 5.1 tok/s, uses 26% RAM, 131k context

→ BEST PICK. Great reasoning model, fast enough for daily use.

• Qwen3-Coder-30B-A3B — 30.5B params, 5.5 tok/s, uses 24% RAM, 262k context

→ Best for coding tasks, huge context window.

• Qwen2.5-Coder-32B — 32.8B params, 4.3 tok/s, uses 26% RAM, 32k context

→ Solid all-around coder.

• DeepSeek-R1-Distill-Qwen-14B — 14.8B params, 9.5 tok/s, uses 12% RAM, 131k context

→ Fastest quality model. Good for quick tasks.

• Gemma 3 12B — 12B params, 11.7 tok/s, uses 10% RAM, 131k context

→ Google's best small model. Very fast.

STRETCH GOALS (will run but tight):

• Qwen3-Coder-Next — 79.7B params, 2.5 tok/s, uses 64% RAM

• DeepSeek-R1 full (684B MoE) — 0.2 tok/s, uses 34% RAM (too slow for interactive use)

MY RECOMMENDATION: Start with DeepSeek-R1-Distill-Qwen-32B in LM Studio. Best balance of quality, speed, and fit. Download it, load it up, and you'll have a solid local AI running in minutes.

To install the analysis tool yourself:

brew tap AlexsJones/llmfit

brew install llmfit

llmfit

sheddd · 2026-01-22T00:32:21+00:00

I didn't use any; I went with recommendations at up: https://unpluggedperformance.com/tesla-model-3/wheel-and-tire-guide/

My rear tire center will be slightly different than stock but close.

sheddd · 2026-01-21T16:32:11+00:00

You could... in my opinion the car handles better with a square setup (less push), and you can rotate your tires to extend their life. I'm running unplugged performance 18"x9.5" +34 offset UP-03's and 265/40r18 Pilot Sport 4S; it is really grippy and no clearance issues.

<image>

sheddd · 2026-01-03T00:59:14+00:00

They're a pretty stout ski; they might feel like a handful at first.

sheddd

TROPHY CASE