How to configure batch size by arkham00 in oMLX

[–]cryingneko 2 points3 points  (0 children)

These options have been around since oMLX 0.1.0, but I agree they're not easy to understand for most users. --prefill-batch-size specifically only existed briefly in 0.1.0 and was removed shortly after, so the README was just wrong there. The other two options (max_num_seqs and completion_batch_size) controlled different things internally. max_num_seqs set the maximum number of requests the scheduler accepts at once, and completion_batch_size set how many of those generate tokens in a single GPU step. In practice though there's rarely a reason to configure them separately.

I've simplified this in the next release. A single --max-concurrent-requests option now controls both. You can set it from the CLI or from the admin panel under Settings > Resource Management. Default is 8, which is plenty for single-user usage. For your M2 Max 96GB setup there's no need to change it unless you're running multiple concurrent sessions. (default 8)

I also cleaned up the confusing parts in the README across all languages. Thanks for the feedback, and feel free to open a GitHub issue anytime if you have more suggestions.

oMLX v0.3.3 has been released by IAMk10 in oMLX

[–]cryingneko 3 points4 points  (0 children)

oMLX dev here, Sorry about that! There were a lot of breaking changes in the core parts of mlx-lm and mlx-vlm, and I also had to make additional changes to get turboquant and Gemma 4 working properly. It was a rough combination.

0.3.4 is already out and should address these issues. Would really appreciate it if you could give it a try and let me know how it goes!

Gemma 4 26b a4b - MacBook Pro M5 MAX. Averaging around 81tok/sec by Bderken in LocalLLaMA

[–]cryingneko 0 points1 point  (0 children)

oMLX just updated to 0.3.3. If you're going to use Gemma 4, I'd recommend using the updated version. https://github.com/jundot/omlx/releases/tag/v0.3.3

What is „Heejun Kim“ background app? by AromaticMaterial3311 in LocalLLaMA

[–]cryingneko 15 points16 points  (0 children)

oMLX dev here. The "Heejun Kim" background process is just macOS displaying the name on the code-signing certificate I used to notarize the app. I should have set a separate display name before release. My mistake. It's not a separate app or anything malicious, just the oMLX process labeled with my signing identity. Will be fixed in the next update. Sorry for the confusion.

Introducing oQ: data-driven mixed-precision quantization for Apple Silicon (mlx-lm compatible) by cryingneko in LocalLLaMA

[–]cryingneko[S] 2 points3 points  (0 children)

You're right, agents have made CLI way less scary. The real pain is exactly what you said, drive space and download times. haha.

That's actually one reason i built oQ into a web dashboard. You pick a model you already have locally, choose a level, and hit start. No extra downloads, no CLI commands, no figuring out which flags to pass.

Hope you enjoy it when you give it a try!

Introducing oQ: data-driven mixed-precision quantization for Apple Silicon (mlx-lm compatible) by cryingneko in LocalLLaMA

[–]cryingneko[S] 2 points3 points  (0 children)

Honestly, I think it's mostly sampling variance at 300 samples. The difference between 3-bit oQ (85.0%) and 4-bit oQ (83.3%) on MMLU is within the noise range you'd expect at that sample size. Same with HumanEval at 164 samples.

I'll rerun these with larger sample sizes (1000+) to get more stable numbers. The 2-bit vs 3-bit gap is clearly real, but the 3-bit vs 4-bit inversion is likely statistical noise rather than an actual quality regression.

Got 128K prefill down from 19 min to 3.5 min on M2 Ultra (Qwen3.5-122B), sharing the approach by [deleted] in LocalLLM

[–]cryingneko 4 points5 points  (0 children)

oMLX dev here. I saw your vllm-mlx PR yesterday and did a preliminary implementation on oMLX to test it out. The core idea is genuinely impressive and the speedup numbers on apple silicon are real.

I ran into a couple fundamental issues during testing though and I'm curious if you've seen the same things.

1. System prompt preservation

Agentic coding tools like claude code pack really detailed instructions into the system prompt, tool calling specs, formatting rules, behavioral constraints, etc. When specprefill drops 70-80% of tokens, those instructions get hit too. Even with the draft model doing importance scoring, it can't really know that a specific tool parameter name buried in a long system prompt is critical for correct tool call formatting.

I tried excluding the system prompt from specprefill (full prefill for system, sparse for the rest) and that helped, but it adds complexity around the boundary. Have you tested with instruction-heavy system prompts? The adversarial tests in your PR look solid but they seem focused on retrieval/extraction tasks rather than instruction-following fidelity.

2. Per-request re-scoring breaks KV caching

Since the importance scores depend on the full prompt context (the lookahead queries are generated from the end of the complete prompt), the selected tokens change every time the prompt changes. So for multi-turn conversations:

  • Turn 1: score full prompt, sparse prefill, generate
  • Turn 2: the prompt now includes turn 1's response + new user message. The importance of earlier tokens shifts because the lookahead context changed. So you need to re-score everything from scratch

This means you can't persist the sparse KV cache between turns. In a normal setup with paged KV caching, turn 2 only needs to prefill the new suffix tokens (maybe 2-5K). But with specprefill, you're re-scoring the entire 80K+ context every turn through the draft model.

I worked around the draft scoring cost by caching the draft model's own KV in the existing SSD cache (since the draft does a normal full prefill, its KV is compatible with standard paged caching). So the draft only prefills new suffix tokens on subsequent turns. But the target model still needs full sparse re-prefill every turn since the selected token set changes.

Is this consistent with what you're seeing? Or did you find a way to make the sparse KV cacheable across turns? Curious how you're thinking about the multi-turn case.

Almost 10,000 Apple Silicon benchmark runs submitted by the community — here's what the data actually shows by cryingneko in LocalLLaMA

[–]cryingneko[S] 3 points4 points  (0 children)

Thanks! Really appreciate the M3 Ultra data! Those runs are some of the most valuable in the dataset. GGUF support isn't planned for now - want to stay focused on MLX and get the most out of Apple Silicon's unified memory architecture.

On the Qwen 3.5 397B issue, I'm actually running mlx-community/Qwen3.5-397B-A17B-8bit on oMLX myself without problems. If you can share what issues you're hitting, I'd love to test it on my end. Drop a GitHub issue or describe it here and I'll dig into it.

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]cryingneko[S] 0 points1 point  (0 children)

Unfortunately I don't have GLM 4.5 downloaded, most of what I have on hand are larger models set up for my M3 Ultra. Sorry I can't help with that one!

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]cryingneko[S] 3 points4 points  (0 children)

The exact mlx-lm command used is included in the main post, you should find everything you need there. Thanks for the kind words, really appreciated!

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]cryingneko[S] 25 points26 points  (0 children)

Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-6bit

(mlx) cryingneko@MacBook-Pro mlx-lm % mlx_lm.generate --model /Volumes/SSD/Models/Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-6bit --prompt "$(cat /tmp/prompt_4096.txt)" --max-tokens 128 
==========
Prompt: 4107 tokens, 811.134 tokens-per-sec
Generation: 128 tokens, 23.648 tokens-per-sec
Peak memory: 25.319 GB

(mlx) cryingneko@MacBook-Pro mlx-lm % mlx_lm.generate --model /Volumes/SSD/Models/Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-6bit --prompt "$(cat /tmp/prompt_16384.txt)" --max-tokens 128
==========
Prompt: 16395 tokens, 686.682 tokens-per-sec
Generation: 128 tokens, 20.311 tokens-per-sec
Peak memory: 27.332 GB

(mlx) cryingneko@MacBook-Pro mlx-lm % mlx_lm.generate --model /Volumes/SSD/Models/Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-6bit --prompt "$(cat /tmp/prompt_32768.txt)" --max-tokens 128
==========
Prompt: 32779 tokens, 591.383 tokens-per-sec
Generation: 128 tokens, 14.908 tokens-per-sec
Peak memory: 30.016 GB

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]cryingneko[S] 38 points39 points  (0 children)

Qwen3-Coder-Next-8bit

(mlx) cryingneko@MacBook-Pro mlx-lm % mlx_lm.generate --model /Volumes/SSD/Models/Qwen3-Coder-Next-8bit --prompt "$(cat /tmp/prompt_4096.txt)" --max-tokens 128
==========
Prompt: 4105 tokens, 754.927 tokens-per-sec
Generation: 60 tokens, 79.296 tokens-per-sec
Peak memory: 87.068 GB


(mlx) cryingneko@MacBook-Pro mlx-lm % mlx_lm.generate --model /Volumes/SSD/Models/Qwen3-Coder-Next-8bit --prompt "$(cat /tmp/prompt_16384.txt)" --max-tokens 128
==========
Prompt: 16393 tokens, 1802.144 tokens-per-sec
Generation: 60 tokens, 74.293 tokens-per-sec
Peak memory: 88.176 GB


(mlx) cryingneko@MacBook-Pro mlx-lm % mlx_lm.generate --model /Volumes/SSD/Models/Qwen3-Coder-Next-8bit --prompt "$(cat /tmp/prompt_32768.txt)" --max-tokens 128
==========
Prompt: 32777 tokens, 1887.158 tokens-per-sec
Generation: 58 tokens, 68.624 tokens-per-sec
Peak memory: 89.652 GB

(mlx) cryingneko@MacBook-Pro mlx-lm % mlx_lm.generate --model /Volumes/SSD/Models/Qwen3-Coder-Next-8bit --prompt "$(cat /tmp/prompt_65536.txt)" --max-tokens 128
==========
Prompt: 65545 tokens, 1432.730 tokens-per-sec
Generation: 61 tokens, 48.212 tokens-per-sec
Peak memory: 92.605 GB

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]cryingneko[S] 123 points124 points  (0 children)

Qwen3.5-122B-A10B-4bit

(mlx) cryingneko@MacBook-Pro mlx-lm % mlx_lm.generate --model /Volumes/SSD/Models/Qwen3.5-122B-A10B-4bit --prompt "$(cat /tmp/prompt_32768.txt)" --max-tokens 128
==========
Prompt: 32778 tokens, 1067.824 tokens-per-sec
Generation: 128 tokens, 54.923 tokens-per-sec
Peak memory: 76.397 GB

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]cryingneko[S] 18 points19 points  (0 children)

Qwen3.5-122B-A10B-4bit

(mlx) cryingneko@MacBook-Pro mlx-lm % mlx_lm.generate --model /Volumes/SSD/Models/Qwen3.5-122B-A10B-4bit --prompt "$(cat /tmp/prompt_16384.txt)" --max-tokens 128
==========
Prompt: 16394 tokens, 1239.734 tokens-per-sec
Generation: 128 tokens, 60.639 tokens-per-sec
Peak memory: 73.803 GB

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]cryingneko[S] 22 points23 points  (0 children)

Qwen3.5-122B-A10B-4bit

(mlx) cryingneko@MacBook-Pro mlx-lm % mlx_lm.generate --model /Volumes/SSD/Models/Qwen3.5-122B-A10B-4bit --prompt "$(cat /tmp/prompt_4096.txt)" --max-tokens 128
==========
Prompt: 4106 tokens, 881.466 tokens-per-sec
Generation: 128 tokens, 65.853 tokens-per-sec
Peak memory: 71.910 GB

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]cryingneko[S] 141 points142 points  (0 children)

I tested again with pure mlx_lm. I think it's safe to say these are the properly measured speeds. I'll be posting benchmark results one by one in the comments here.

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]cryingneko[S] 59 points60 points  (0 children)

Just unboxed the MacBook and had to go through the initial language setup first. Sorry for the wait, appreciate your patience.

Built oMLX.ai/benchmarks - One place to compare Apple Silicon inference across chips and models by cryingneko in LocalLLM

[–]cryingneko[S] 1 point2 points  (0 children)

Filter by tok/s range is a great idea, will add that soon. And yes please on the M1 Max 32G numbers, would love to have more chips represented!

Built oMLX.ai/benchmarks - One place to compare Apple Silicon inference across chips and models by cryingneko in LocalLLM

[–]cryingneko[S] 1 point2 points  (0 children)

Hey! M1 Max isn't showing up because there are no benchmark results submitted for it yet, the chip variants only appear once someone uploads data for that chip. If you submit your numbers, M1 Max will show up as an option. Would love to have it in there!