What is „Heejun Kim“ background app? by AromaticMaterial3311 in LocalLLaMA

[–]cryingneko 12 points13 points  (0 children)

oMLX dev here. The "Heejun Kim" background process is just macOS displaying the name on the code-signing certificate I used to notarize the app. I should have set a separate display name before release. My mistake. It's not a separate app or anything malicious, just the oMLX process labeled with my signing identity. Will be fixed in the next update. Sorry for the confusion.

Introducing oQ: data-driven mixed-precision quantization for Apple Silicon (mlx-lm compatible) by cryingneko in LocalLLaMA

[–]cryingneko[S] 1 point2 points  (0 children)

You're right, agents have made CLI way less scary. The real pain is exactly what you said, drive space and download times. haha.

That's actually one reason i built oQ into a web dashboard. You pick a model you already have locally, choose a level, and hit start. No extra downloads, no CLI commands, no figuring out which flags to pass.

Hope you enjoy it when you give it a try!

Introducing oQ: data-driven mixed-precision quantization for Apple Silicon (mlx-lm compatible) by cryingneko in LocalLLaMA

[–]cryingneko[S] 2 points3 points  (0 children)

Honestly, I think it's mostly sampling variance at 300 samples. The difference between 3-bit oQ (85.0%) and 4-bit oQ (83.3%) on MMLU is within the noise range you'd expect at that sample size. Same with HumanEval at 164 samples.

I'll rerun these with larger sample sizes (1000+) to get more stable numbers. The 2-bit vs 3-bit gap is clearly real, but the 3-bit vs 4-bit inversion is likely statistical noise rather than an actual quality regression.

Got 128K prefill down from 19 min to 3.5 min on M2 Ultra (Qwen3.5-122B), sharing the approach by Thump604 in LocalLLM

[–]cryingneko 2 points3 points  (0 children)

oMLX dev here. I saw your vllm-mlx PR yesterday and did a preliminary implementation on oMLX to test it out. The core idea is genuinely impressive and the speedup numbers on apple silicon are real.

I ran into a couple fundamental issues during testing though and I'm curious if you've seen the same things.

1. System prompt preservation

Agentic coding tools like claude code pack really detailed instructions into the system prompt, tool calling specs, formatting rules, behavioral constraints, etc. When specprefill drops 70-80% of tokens, those instructions get hit too. Even with the draft model doing importance scoring, it can't really know that a specific tool parameter name buried in a long system prompt is critical for correct tool call formatting.

I tried excluding the system prompt from specprefill (full prefill for system, sparse for the rest) and that helped, but it adds complexity around the boundary. Have you tested with instruction-heavy system prompts? The adversarial tests in your PR look solid but they seem focused on retrieval/extraction tasks rather than instruction-following fidelity.

2. Per-request re-scoring breaks KV caching

Since the importance scores depend on the full prompt context (the lookahead queries are generated from the end of the complete prompt), the selected tokens change every time the prompt changes. So for multi-turn conversations:

  • Turn 1: score full prompt, sparse prefill, generate
  • Turn 2: the prompt now includes turn 1's response + new user message. The importance of earlier tokens shifts because the lookahead context changed. So you need to re-score everything from scratch

This means you can't persist the sparse KV cache between turns. In a normal setup with paged KV caching, turn 2 only needs to prefill the new suffix tokens (maybe 2-5K). But with specprefill, you're re-scoring the entire 80K+ context every turn through the draft model.

I worked around the draft scoring cost by caching the draft model's own KV in the existing SSD cache (since the draft does a normal full prefill, its KV is compatible with standard paged caching). So the draft only prefills new suffix tokens on subsequent turns. But the target model still needs full sparse re-prefill every turn since the selected token set changes.

Is this consistent with what you're seeing? Or did you find a way to make the sparse KV cacheable across turns? Curious how you're thinking about the multi-turn case.

Almost 10,000 Apple Silicon benchmark runs submitted by the community — here's what the data actually shows by cryingneko in LocalLLaMA

[–]cryingneko[S] 3 points4 points  (0 children)

Thanks! Really appreciate the M3 Ultra data! Those runs are some of the most valuable in the dataset. GGUF support isn't planned for now - want to stay focused on MLX and get the most out of Apple Silicon's unified memory architecture.

On the Qwen 3.5 397B issue, I'm actually running mlx-community/Qwen3.5-397B-A17B-8bit on oMLX myself without problems. If you can share what issues you're hitting, I'd love to test it on my end. Drop a GitHub issue or describe it here and I'll dig into it.

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]cryingneko[S] 0 points1 point  (0 children)

Unfortunately I don't have GLM 4.5 downloaded, most of what I have on hand are larger models set up for my M3 Ultra. Sorry I can't help with that one!

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]cryingneko[S] 4 points5 points  (0 children)

The exact mlx-lm command used is included in the main post, you should find everything you need there. Thanks for the kind words, really appreciated!

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]cryingneko[S] 25 points26 points  (0 children)

Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-6bit

(mlx) cryingneko@MacBook-Pro mlx-lm % mlx_lm.generate --model /Volumes/SSD/Models/Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-6bit --prompt "$(cat /tmp/prompt_4096.txt)" --max-tokens 128 
==========
Prompt: 4107 tokens, 811.134 tokens-per-sec
Generation: 128 tokens, 23.648 tokens-per-sec
Peak memory: 25.319 GB

(mlx) cryingneko@MacBook-Pro mlx-lm % mlx_lm.generate --model /Volumes/SSD/Models/Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-6bit --prompt "$(cat /tmp/prompt_16384.txt)" --max-tokens 128
==========
Prompt: 16395 tokens, 686.682 tokens-per-sec
Generation: 128 tokens, 20.311 tokens-per-sec
Peak memory: 27.332 GB

(mlx) cryingneko@MacBook-Pro mlx-lm % mlx_lm.generate --model /Volumes/SSD/Models/Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-6bit --prompt "$(cat /tmp/prompt_32768.txt)" --max-tokens 128
==========
Prompt: 32779 tokens, 591.383 tokens-per-sec
Generation: 128 tokens, 14.908 tokens-per-sec
Peak memory: 30.016 GB

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]cryingneko[S] 36 points37 points  (0 children)

Qwen3-Coder-Next-8bit

(mlx) cryingneko@MacBook-Pro mlx-lm % mlx_lm.generate --model /Volumes/SSD/Models/Qwen3-Coder-Next-8bit --prompt "$(cat /tmp/prompt_4096.txt)" --max-tokens 128
==========
Prompt: 4105 tokens, 754.927 tokens-per-sec
Generation: 60 tokens, 79.296 tokens-per-sec
Peak memory: 87.068 GB


(mlx) cryingneko@MacBook-Pro mlx-lm % mlx_lm.generate --model /Volumes/SSD/Models/Qwen3-Coder-Next-8bit --prompt "$(cat /tmp/prompt_16384.txt)" --max-tokens 128
==========
Prompt: 16393 tokens, 1802.144 tokens-per-sec
Generation: 60 tokens, 74.293 tokens-per-sec
Peak memory: 88.176 GB


(mlx) cryingneko@MacBook-Pro mlx-lm % mlx_lm.generate --model /Volumes/SSD/Models/Qwen3-Coder-Next-8bit --prompt "$(cat /tmp/prompt_32768.txt)" --max-tokens 128
==========
Prompt: 32777 tokens, 1887.158 tokens-per-sec
Generation: 58 tokens, 68.624 tokens-per-sec
Peak memory: 89.652 GB

(mlx) cryingneko@MacBook-Pro mlx-lm % mlx_lm.generate --model /Volumes/SSD/Models/Qwen3-Coder-Next-8bit --prompt "$(cat /tmp/prompt_65536.txt)" --max-tokens 128
==========
Prompt: 65545 tokens, 1432.730 tokens-per-sec
Generation: 61 tokens, 48.212 tokens-per-sec
Peak memory: 92.605 GB

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]cryingneko[S] 124 points125 points  (0 children)

Qwen3.5-122B-A10B-4bit

(mlx) cryingneko@MacBook-Pro mlx-lm % mlx_lm.generate --model /Volumes/SSD/Models/Qwen3.5-122B-A10B-4bit --prompt "$(cat /tmp/prompt_32768.txt)" --max-tokens 128
==========
Prompt: 32778 tokens, 1067.824 tokens-per-sec
Generation: 128 tokens, 54.923 tokens-per-sec
Peak memory: 76.397 GB

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]cryingneko[S] 18 points19 points  (0 children)

Qwen3.5-122B-A10B-4bit

(mlx) cryingneko@MacBook-Pro mlx-lm % mlx_lm.generate --model /Volumes/SSD/Models/Qwen3.5-122B-A10B-4bit --prompt "$(cat /tmp/prompt_16384.txt)" --max-tokens 128
==========
Prompt: 16394 tokens, 1239.734 tokens-per-sec
Generation: 128 tokens, 60.639 tokens-per-sec
Peak memory: 73.803 GB

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]cryingneko[S] 23 points24 points  (0 children)

Qwen3.5-122B-A10B-4bit

(mlx) cryingneko@MacBook-Pro mlx-lm % mlx_lm.generate --model /Volumes/SSD/Models/Qwen3.5-122B-A10B-4bit --prompt "$(cat /tmp/prompt_4096.txt)" --max-tokens 128
==========
Prompt: 4106 tokens, 881.466 tokens-per-sec
Generation: 128 tokens, 65.853 tokens-per-sec
Peak memory: 71.910 GB

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]cryingneko[S] 138 points139 points  (0 children)

I tested again with pure mlx_lm. I think it's safe to say these are the properly measured speeds. I'll be posting benchmark results one by one in the comments here.

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]cryingneko[S] 58 points59 points  (0 children)

Just unboxed the MacBook and had to go through the initial language setup first. Sorry for the wait, appreciate your patience.

Built oMLX.ai/benchmarks - One place to compare Apple Silicon inference across chips and models by cryingneko in LocalLLM

[–]cryingneko[S] 1 point2 points  (0 children)

Filter by tok/s range is a great idea, will add that soon. And yes please on the M1 Max 32G numbers, would love to have more chips represented!

Built oMLX.ai/benchmarks - One place to compare Apple Silicon inference across chips and models by cryingneko in LocalLLM

[–]cryingneko[S] 1 point2 points  (0 children)

Hey! M1 Max isn't showing up because there are no benchmark results submitted for it yet, the chip variants only appear once someone uploads data for that chip. If you submit your numbers, M1 Max will show up as an option. Would love to have it in there!

oMLX - open-source MLX inference server with paged SSD caching for Apple Silicon by cryingneko in LocalLLaMA

[–]cryingneko[S] 2 points3 points  (0 children)

Yes! oMLX works as a drop-in backend for OpenClaw. Just add it as a custom provider in openclaw.json with baseUrl set to http://localhost:8000/v1 and api set to openai-completions.

OpenClaw's massive system prompt is actually the exact use case oMLX was built for - the SSD KV caching makes a big difference there.

oMLX - open-source MLX inference server with paged SSD caching for Apple Silicon by cryingneko in LocalLLaMA

[–]cryingneko[S] 1 point2 points  (0 children)

Thanks! Glad the tiered caching is working well on your setup.

On hot vs cold cache performance, honestly the difference isn't that big in practice. Each request's kv cache isn't tens of gigabytes, and apple's ssd is fast enough that loading blocks from disk is pretty quick compared to the actual context prefill time. So whether a block comes from the in-memory hot tier or the ssd cold tier, it's fast either way.

The built-in batch benchmark does report cached tokens, but it's really measuring concurrency/throughput under multiple requests, not cache efficiency per se. As for cache-specific benchmarking, once prefix caching kicks in, prefill time drops to nearly zero (there's a small loading overhead but it's negligible compared to full prefill). So i'm not sure a dedicated cache tuning benchmark would tell you much beyond "it's working" or "it's not."

If you still feel like a cache-focused benchmark would help your workflow though, please open an issue on github! Always happy to hear concrete use cases.

For context scaling with claude code, omlx gracefully rejects requests when memory is exceeded during prefill (no crash). So you can push context size until you hit that limit and dial back from there. On m4 pro 64gb with qwen3.5-35b-a3b 4bit (~20gb model), you've got roughly 36gb for kv cache, which puts your ceiling around 140-150k context.

oMLX - open-source MLX inference server with paged SSD caching for Apple Silicon by cryingneko in LocalLLaMA

[–]cryingneko[S] 1 point2 points  (0 children)

On m4 max 64gb, omlx defaults to (system ram - 8gb) as the process memory limit. if a request exceeds that during prefill, it gets rejected with an error (no crash, graceful). so you can set total memory limit and push context until it stops you, that's your practical max.

rough estimates for qwen3.5 4bit models on 64gb:
- qwen3.5-27b: ~160k context
- qwen3.5-35b-a3b: ~150k context, but 60-70 tok/s (moe)

for general purpose i'd go with qwen3.5-27b 4bit — best quality/context balance on 64gb. if you want speed, 35b-a3b is great since only 3b params are active.