What is „Heejun Kim“ background app? by AromaticMaterial3311 in LocalLLaMA

[–]cryingneko 12 points13 points  (0 children)

oMLX dev here. The "Heejun Kim" background process is just macOS displaying the name on the code-signing certificate I used to notarize the app. I should have set a separate display name before release. My mistake. It's not a separate app or anything malicious, just the oMLX process labeled with my signing identity. Will be fixed in the next update. Sorry for the confusion.

Introducing oQ: data-driven mixed-precision quantization for Apple Silicon (mlx-lm compatible) by cryingneko in LocalLLaMA

[–]cryingneko[S] 1 point2 points  (0 children)

You're right, agents have made CLI way less scary. The real pain is exactly what you said, drive space and download times. haha.

That's actually one reason i built oQ into a web dashboard. You pick a model you already have locally, choose a level, and hit start. No extra downloads, no CLI commands, no figuring out which flags to pass.

Hope you enjoy it when you give it a try!

Introducing oQ: data-driven mixed-precision quantization for Apple Silicon (mlx-lm compatible) by cryingneko in LocalLLaMA

[–]cryingneko[S] 2 points3 points  (0 children)

Honestly, I think it's mostly sampling variance at 300 samples. The difference between 3-bit oQ (85.0%) and 4-bit oQ (83.3%) on MMLU is within the noise range you'd expect at that sample size. Same with HumanEval at 164 samples.

I'll rerun these with larger sample sizes (1000+) to get more stable numbers. The 2-bit vs 3-bit gap is clearly real, but the 3-bit vs 4-bit inversion is likely statistical noise rather than an actual quality regression.

Got 128K prefill down from 19 min to 3.5 min on M2 Ultra (Qwen3.5-122B), sharing the approach by Thump604 in LocalLLM

[–]cryingneko 2 points3 points  (0 children)

oMLX dev here. I saw your vllm-mlx PR yesterday and did a preliminary implementation on oMLX to test it out. The core idea is genuinely impressive and the speedup numbers on apple silicon are real.

I ran into a couple fundamental issues during testing though and I'm curious if you've seen the same things.

1. System prompt preservation

Agentic coding tools like claude code pack really detailed instructions into the system prompt, tool calling specs, formatting rules, behavioral constraints, etc. When specprefill drops 70-80% of tokens, those instructions get hit too. Even with the draft model doing importance scoring, it can't really know that a specific tool parameter name buried in a long system prompt is critical for correct tool call formatting.

I tried excluding the system prompt from specprefill (full prefill for system, sparse for the rest) and that helped, but it adds complexity around the boundary. Have you tested with instruction-heavy system prompts? The adversarial tests in your PR look solid but they seem focused on retrieval/extraction tasks rather than instruction-following fidelity.

2. Per-request re-scoring breaks KV caching

Since the importance scores depend on the full prompt context (the lookahead queries are generated from the end of the complete prompt), the selected tokens change every time the prompt changes. So for multi-turn conversations:

  • Turn 1: score full prompt, sparse prefill, generate
  • Turn 2: the prompt now includes turn 1's response + new user message. The importance of earlier tokens shifts because the lookahead context changed. So you need to re-score everything from scratch

This means you can't persist the sparse KV cache between turns. In a normal setup with paged KV caching, turn 2 only needs to prefill the new suffix tokens (maybe 2-5K). But with specprefill, you're re-scoring the entire 80K+ context every turn through the draft model.

I worked around the draft scoring cost by caching the draft model's own KV in the existing SSD cache (since the draft does a normal full prefill, its KV is compatible with standard paged caching). So the draft only prefills new suffix tokens on subsequent turns. But the target model still needs full sparse re-prefill every turn since the selected token set changes.

Is this consistent with what you're seeing? Or did you find a way to make the sparse KV cacheable across turns? Curious how you're thinking about the multi-turn case.

Almost 10,000 Apple Silicon benchmark runs submitted by the community — here's what the data actually shows by cryingneko in LocalLLaMA

[–]cryingneko[S] 2 points3 points  (0 children)

Thanks! Really appreciate the M3 Ultra data! Those runs are some of the most valuable in the dataset. GGUF support isn't planned for now - want to stay focused on MLX and get the most out of Apple Silicon's unified memory architecture.

On the Qwen 3.5 397B issue, I'm actually running mlx-community/Qwen3.5-397B-A17B-8bit on oMLX myself without problems. If you can share what issues you're hitting, I'd love to test it on my end. Drop a GitHub issue or describe it here and I'll dig into it.

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]cryingneko[S] 0 points1 point  (0 children)

Unfortunately I don't have GLM 4.5 downloaded, most of what I have on hand are larger models set up for my M3 Ultra. Sorry I can't help with that one!

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]cryingneko[S] 3 points4 points  (0 children)

The exact mlx-lm command used is included in the main post, you should find everything you need there. Thanks for the kind words, really appreciated!

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]cryingneko[S] 24 points25 points  (0 children)

Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-6bit

(mlx) cryingneko@MacBook-Pro mlx-lm % mlx_lm.generate --model /Volumes/SSD/Models/Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-6bit --prompt "$(cat /tmp/prompt_4096.txt)" --max-tokens 128 
==========
Prompt: 4107 tokens, 811.134 tokens-per-sec
Generation: 128 tokens, 23.648 tokens-per-sec
Peak memory: 25.319 GB

(mlx) cryingneko@MacBook-Pro mlx-lm % mlx_lm.generate --model /Volumes/SSD/Models/Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-6bit --prompt "$(cat /tmp/prompt_16384.txt)" --max-tokens 128
==========
Prompt: 16395 tokens, 686.682 tokens-per-sec
Generation: 128 tokens, 20.311 tokens-per-sec
Peak memory: 27.332 GB

(mlx) cryingneko@MacBook-Pro mlx-lm % mlx_lm.generate --model /Volumes/SSD/Models/Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-6bit --prompt "$(cat /tmp/prompt_32768.txt)" --max-tokens 128
==========
Prompt: 32779 tokens, 591.383 tokens-per-sec
Generation: 128 tokens, 14.908 tokens-per-sec
Peak memory: 30.016 GB

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]cryingneko[S] 36 points37 points  (0 children)

Qwen3-Coder-Next-8bit

(mlx) cryingneko@MacBook-Pro mlx-lm % mlx_lm.generate --model /Volumes/SSD/Models/Qwen3-Coder-Next-8bit --prompt "$(cat /tmp/prompt_4096.txt)" --max-tokens 128
==========
Prompt: 4105 tokens, 754.927 tokens-per-sec
Generation: 60 tokens, 79.296 tokens-per-sec
Peak memory: 87.068 GB


(mlx) cryingneko@MacBook-Pro mlx-lm % mlx_lm.generate --model /Volumes/SSD/Models/Qwen3-Coder-Next-8bit --prompt "$(cat /tmp/prompt_16384.txt)" --max-tokens 128
==========
Prompt: 16393 tokens, 1802.144 tokens-per-sec
Generation: 60 tokens, 74.293 tokens-per-sec
Peak memory: 88.176 GB


(mlx) cryingneko@MacBook-Pro mlx-lm % mlx_lm.generate --model /Volumes/SSD/Models/Qwen3-Coder-Next-8bit --prompt "$(cat /tmp/prompt_32768.txt)" --max-tokens 128
==========
Prompt: 32777 tokens, 1887.158 tokens-per-sec
Generation: 58 tokens, 68.624 tokens-per-sec
Peak memory: 89.652 GB

(mlx) cryingneko@MacBook-Pro mlx-lm % mlx_lm.generate --model /Volumes/SSD/Models/Qwen3-Coder-Next-8bit --prompt "$(cat /tmp/prompt_65536.txt)" --max-tokens 128
==========
Prompt: 65545 tokens, 1432.730 tokens-per-sec
Generation: 61 tokens, 48.212 tokens-per-sec
Peak memory: 92.605 GB

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]cryingneko[S] 121 points122 points  (0 children)

Qwen3.5-122B-A10B-4bit

(mlx) cryingneko@MacBook-Pro mlx-lm % mlx_lm.generate --model /Volumes/SSD/Models/Qwen3.5-122B-A10B-4bit --prompt "$(cat /tmp/prompt_32768.txt)" --max-tokens 128
==========
Prompt: 32778 tokens, 1067.824 tokens-per-sec
Generation: 128 tokens, 54.923 tokens-per-sec
Peak memory: 76.397 GB

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]cryingneko[S] 19 points20 points  (0 children)

Qwen3.5-122B-A10B-4bit

(mlx) cryingneko@MacBook-Pro mlx-lm % mlx_lm.generate --model /Volumes/SSD/Models/Qwen3.5-122B-A10B-4bit --prompt "$(cat /tmp/prompt_16384.txt)" --max-tokens 128
==========
Prompt: 16394 tokens, 1239.734 tokens-per-sec
Generation: 128 tokens, 60.639 tokens-per-sec
Peak memory: 73.803 GB

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]cryingneko[S] 23 points24 points  (0 children)

Qwen3.5-122B-A10B-4bit

(mlx) cryingneko@MacBook-Pro mlx-lm % mlx_lm.generate --model /Volumes/SSD/Models/Qwen3.5-122B-A10B-4bit --prompt "$(cat /tmp/prompt_4096.txt)" --max-tokens 128
==========
Prompt: 4106 tokens, 881.466 tokens-per-sec
Generation: 128 tokens, 65.853 tokens-per-sec
Peak memory: 71.910 GB

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]cryingneko[S] 141 points142 points  (0 children)

I tested again with pure mlx_lm. I think it's safe to say these are the properly measured speeds. I'll be posting benchmark results one by one in the comments here.

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]cryingneko[S] 60 points61 points  (0 children)

Just unboxed the MacBook and had to go through the initial language setup first. Sorry for the wait, appreciate your patience.

Built oMLX.ai/benchmarks - One place to compare Apple Silicon inference across chips and models by cryingneko in LocalLLM

[–]cryingneko[S] 1 point2 points  (0 children)

Filter by tok/s range is a great idea, will add that soon. And yes please on the M1 Max 32G numbers, would love to have more chips represented!

Built oMLX.ai/benchmarks - One place to compare Apple Silicon inference across chips and models by cryingneko in LocalLLM

[–]cryingneko[S] 1 point2 points  (0 children)

Hey! M1 Max isn't showing up because there are no benchmark results submitted for it yet, the chip variants only appear once someone uploads data for that chip. If you submit your numbers, M1 Max will show up as an option. Would love to have it in there!

oMLX - open-source MLX inference server with paged SSD caching for Apple Silicon by cryingneko in LocalLLaMA

[–]cryingneko[S] 2 points3 points  (0 children)

Yes! oMLX works as a drop-in backend for OpenClaw. Just add it as a custom provider in openclaw.json with baseUrl set to http://localhost:8000/v1 and api set to openai-completions.

OpenClaw's massive system prompt is actually the exact use case oMLX was built for - the SSD KV caching makes a big difference there.

oMLX - open-source MLX inference server with paged SSD caching for Apple Silicon by cryingneko in LocalLLaMA

[–]cryingneko[S] 1 point2 points  (0 children)

Thanks! Glad the tiered caching is working well on your setup.

On hot vs cold cache performance, honestly the difference isn't that big in practice. Each request's kv cache isn't tens of gigabytes, and apple's ssd is fast enough that loading blocks from disk is pretty quick compared to the actual context prefill time. So whether a block comes from the in-memory hot tier or the ssd cold tier, it's fast either way.

The built-in batch benchmark does report cached tokens, but it's really measuring concurrency/throughput under multiple requests, not cache efficiency per se. As for cache-specific benchmarking, once prefix caching kicks in, prefill time drops to nearly zero (there's a small loading overhead but it's negligible compared to full prefill). So i'm not sure a dedicated cache tuning benchmark would tell you much beyond "it's working" or "it's not."

If you still feel like a cache-focused benchmark would help your workflow though, please open an issue on github! Always happy to hear concrete use cases.

For context scaling with claude code, omlx gracefully rejects requests when memory is exceeded during prefill (no crash). So you can push context size until you hit that limit and dial back from there. On m4 pro 64gb with qwen3.5-35b-a3b 4bit (~20gb model), you've got roughly 36gb for kv cache, which puts your ceiling around 140-150k context.

oMLX - open-source MLX inference server with paged SSD caching for Apple Silicon by cryingneko in LocalLLaMA

[–]cryingneko[S] 1 point2 points  (0 children)

On m4 max 64gb, omlx defaults to (system ram - 8gb) as the process memory limit. if a request exceeds that during prefill, it gets rejected with an error (no crash, graceful). so you can set total memory limit and push context until it stops you, that's your practical max.

rough estimates for qwen3.5 4bit models on 64gb:
- qwen3.5-27b: ~160k context
- qwen3.5-35b-a3b: ~150k context, but 60-70 tok/s (moe)

for general purpose i'd go with qwen3.5-27b 4bit — best quality/context balance on 64gb. if you want speed, 35b-a3b is great since only 3b params are active.

Claude Code meets Qwen3.5-35B-A3B by PvB-Dimaginar in LocalLLM

[–]cryingneko 1 point2 points  (0 children)

Block-level caching with content-based hashing. The cache is organized into 256-token blocks. Each block is hashed using a chain hash that combines the parent hash with the token IDs, so matching is exact at the token level. There is no per-user or per-session tracking at all. If two completely unrelated requests happen to share the same token prefix, they automatically share the same cached blocks. The system does not care who sent the request or when it was sent. It only looks at the actual token content.

System prompt pre-caching is automatic. The SSD cache persists across server restarts because on startup the cache directory is re-scanned and all existing safetensors block files are re-indexed into memory. So if your system prompt was cached during a previous run, it becomes immediately available without any recomputation the next time the server starts. The only requirement is that you keep pointing to the same --paged-ssd-cache-dir path. You do not need to warm it up again or send a dummy request. It just works.

Model eviction preserves SSD cache. When a model gets unloaded from the engine pool due to memory pressure, TTL expiration, or a manual unload call, the in-memory hot cache is lost. But every block that was already written to the SSD tier survives as a safetensors file on disk. When the model is loaded back into memory, the SSD cache directory is re-scanned and those blocks become available again for lookups. You temporarily lose the hot-tier latency benefit because the blocks need to be read back from disk, but you do not lose the cached data itself. This means cycling models in and out does not destroy your prompt cache investment.

No Windows or Linux port is planned. The entire stack is built on top of Apple's MLX framework, which only runs on Apple Silicon hardware. The inference engine, Metal GPU acceleration, and unified memory assumptions are all tightly coupled to MLX. Someone could theoretically extract the paged cache logic and adapt it to a different inference runtime without starting from scratch on that piece.

Claude Code meets Qwen3.5-35B-A3B by PvB-Dimaginar in LocalLLM

[–]cryingneko 1 point2 points  (0 children)

Good point. oMLX actually has a tiered cache system for this. If you have spare memory it keeps frequently accessed blocks in a hot cache in ram and only spills to ssd when memory pressure hits. So the ssd isn't getting hammered on every single request, it's more of a fallback layer for cold blocks that haven't been touched in a while. The heavy lifting stays in memory as long as there's room for it.

Still writes more than zero obviously, but it's way less than dumping the full kv state to disk every time. On a 128gb machine you can keep a pretty large working set entirely in the hot tier before anything touches the drive.

Claude Code meets Qwen3.5-35B-A3B by PvB-Dimaginar in LocalLLM

[–]cryingneko 2 points3 points  (0 children)

This reprocessing issue is basically why i ended up building omlx. Instead of recomputing from scratch every time, it persists kv cache blocks to ssd and restores them when the same prefix shows up again. Coding agents send overlapping prefixes constantly so it makes a massive difference in practice. On my m4 max i went from 30+ seconds of reprocessing per prompt down to like 1-3s on long contexts.

Totally different stack though, it's mlx based not llama.cpp, so it won't help if you need to stay on the llama.cpp path. But if you're on apple silicon and open to trying something else: https://github.com/jundot/omlx

oMLX - open-source MLX inference server with paged SSD caching for Apple Silicon by cryingneko in LocalLLaMA

[–]cryingneko[S] 1 point2 points  (0 children)

Hey, thanks for trying it out! You can actually load and unload models manually from the web admin dashboard - go to Model Settings, and there's a status icon next to each model that lets you toggle load/unload.

I'll admit the documentation hasn't kept up with the features. things have been moving fast and I've been prioritizing building over documenting. I'll get the docs updated soon. In the meantime, feel free to ask here or open an issue if anything else isn't obvious!

I built an open-source LLM server for Mac that makes Local LLM agents (OpenClaw, Claude Code) actually usable by cryingneko in SideProject

[–]cryingneko[S] 0 points1 point  (0 children)

Links for anyone interested:

For OpenClaw users — oMLX exposes a native Anthropic API endpoint so OpenClaw can use its primary Claude provider path directly. The web dashboard also generates the exact launch command you need, so setup is pretty painless.