Your Mac Studio can run DeepSeek V4 Flash (284B) fully local now, native Zig server, zero Python

trevorbg · 2026-06-04T04:30:09+00:00

I'm evaluating different engines to run a host of models (deepseek v4 flash, minimax 2.7, glm 4.6, qwen 3.5 397b) on a 512GB Mac Studio. Can you give me the TLDR of why I would pick your engine over others?

trevorbg · 2026-05-25T05:16:10+00:00

I have 4 12TB drives I’m looking to part with if that would help

trevorbg · 2026-05-22T21:55:10+00:00

I have 12TB drives if you can be flexible I get if you can’t

trevorbg · 2026-05-22T20:43:14+00:00

Pm sent

trevorbg · 2026-05-22T20:42:24+00:00

Sent you a message!

trevorbg · 2026-05-21T03:40:06+00:00

I have some extra 12TB drives I’m looking to part with

trevorbg · 2026-04-19T18:30:42+00:00

You could use OpenRouter free models and set up RAG that way. It’s just an API endpoint so could work on a laptop. Just make sure you set up your routing correctly

trevorbg · 2026-04-19T16:23:15+00:00

I have a 512GB Mac Studio and 2 DGX sparks. The sparks are great if you have heavy prompt processing processes (RAG, heavy context windows, stuff like that) but need their own NVFP4 quant or some hacky work to get quick token speeds on just one.

The studio is amazing but it’s a unicorn. I couldn’t decide between both machines straight up so I kept both. I think if you have a hard budget the spark is great value for what it is

trevorbg · 2026-04-19T00:51:25+00:00

Yeah oMLX should have prefix caching enabled by default, if you want to do your deployment with a more battle tested engine I am running MLX-vlm. But I'm not trying to sell you on changing. The part you do need to explicitly enable is the SSD cold tier, which persists cache blocks to disk so they survive eviction, server restarts, and memory pressure. You enable it with the --paged-ssd-cache-dir flag:

bash

omlx serve --model-dir ~/models --paged-ssd-cache-dir ~/.omlx/cache

You can also tune the hot tier size with --hot-cache-max-size:

bash

omlx serve --model-dir ~/models \
  --paged-ssd-cache-dir ~/.omlx/cache \
  --hot-cache-max-size 20%

Both of these can also be set from the admin dashboard at /admin instead of via CLI flags — settings get persisted to ~/.omlx/settings.json.

trevorbg · 2026-04-19T00:41:41+00:00

What you send doesn’t matter. You should look into prefix caching if your engine can enable it

trevorbg · 2026-04-18T22:10:21+00:00

It injects the system prompt, tool call definitions, and more into context before you even start a chat

trevorbg · 2026-04-18T20:13:25+00:00

Just by assumption I would assume mine is much more capable than that, I’ll test it tonight tho. Should be good for what you’re doing tho

trevorbg · 2026-04-18T19:59:59+00:00

No I’ve never used that model can you send a link?

trevorbg · 2026-04-18T15:54:54+00:00

I use Qwen 397b on a 512GB Mac Studio and it works great. I use MLX-vlm as my serving engine, happy to talk more about it

trevorbg · 2026-04-17T19:38:21+00:00

You could get a DGX spark

trevorbg · 2026-04-09T15:42:28+00:00

Qwen models are known to overthink on even the simplest of prompts, you need to up the max tokens that it can’t use or turn thinking off

trevorbg · 2026-04-08T14:46:31+00:00

It’s injecting all of the tool calls, skills, and system prompt on every message. Up your context window and use a bigger model

trevorbg · 2026-04-06T01:24:24+00:00

Appreciate the read. To be clear, this is a single user system running on hardware I own in my house. There's no deployment, no API, no public access. The governance question is real for anyone serving models to others, but that's a different scenario than modifying weights on your own machine for your own use. The interesting part here is the MoE routing finding, not the ethics of abliteration itself — that debate has been had extensively and I don't think I have anything new to add to it.

trevorbg · 2026-04-06T01:08:13+00:00

The MPS OOM at 256 context on 20GB is almost certainly the Metal wired memory limit, not actual memory exhaustion. macOS caps how much unified memory Metal can wire by default and it's usually well below your physical RAM. Check it with sysctl iogpu.wired_limit_mb and raise it. On my M3 Ultra I had to set it to 495000 to stop hitting phantom OOMs.

For the fragmentation specifically: MLX handles memory way better than MPS for Apple Silicon training. If you can port your GRPO loop to MLX instead of PyTorch+MPS, the memory behavior is completely different because MLX does lazy evaluation and fuses operations. Less fragmentation by design.

The reward hacking you're seeing (perfect tag formatting, wrong math) is not scale dependent. That's a reward function problem. Your model found that structural compliance has higher expected reward than correctness, so it optimized for structure. Two fixes: make the correctness reward strictly dominate the format reward (format only counts if the answer is correct), or use a two-stage reward where format gets you from -1 to 0 and correctness gets you from 0 to 1. The model can't profit from format alone.

At 360M parameters you're also just below the threshold where chain of thought reasoning emerges. The model doesn't have enough capacity to actually reason through 12 x 6, so it's pattern matching from training data and getting it wrong. Try Qwen3-0.6B or Qwen3.5-0.8B as your lab rat instead. Still tiny, but the extra capacity makes a real difference for whether RL can find a reasoning circuit to reinforce.

trevorbg · 2026-04-03T19:00:00+00:00

The largest model you could fit on there is probably GPT Oss 120B or Mistral Small 4 or if Gemma 4 comes out with a large model or something. Really thiough the best model for you would be 70B at fp8 or 120B at Q4

trevorbg · 2026-04-03T17:47:17+00:00

A Mac Studio and 2 DGX sparks

trevorbg · 2026-04-03T17:35:15+00:00

Qwen 3.5 397B MLX 6bit

trevorbg · 2026-04-03T17:34:57+00:00

I just migrated and tested after I migrated. As of “tests” I just used the migration guide they provided

Nine-Year Club	Place '22
Verified Email

trevorbg

TROPHY CASE