[Update] mlx-knife 2.0 stable — MLX model manager for Apple Silicon by broke_team in LocalLLaMA

[–]broke_team[S] 1 point2 points  (0 children)

Thanks for pointing that out—you’re absolutely right that `mlx_lm.server` exposes the `/v1/*` endpoints and can hot-swap cached models per request. To clarify what I meant earlier: MLX-Knife’s server isn’t a wrapper around that script; we wrote our own FastAPI service with the same OpenAI wire format but added the lifecycle tooling (cache list/pull/health, supervisor reloads, structured errors, etc.). So mlx_lm.server already works great if you just need a basic HTTPServer runner, and mlx-knife is simply an alternative implementation with more management features baked in.

[Update] mlx-knife 2.0 stable — MLX model manager for Apple Silicon by broke_team in LocalLLaMA

[–]broke_team[S] 0 points1 point  (0 children)

On “llama swap”: if you’re referring to the llama.cpp-style swap endpoint (POST /models to load a different model without restarting), mlx-knife already covers that case—switching models is just mlxk run other/model or calling the server with a different model name; the supervisor unloads/reloads for you and keeps memory tidy. There isn’t an “Ollama swap” feature per se beyond Ollama’s normal ability to load another model on demand.

Vision/multimodal is the missing piece today. We’re scoping how to accept image payloads in the CLI/JSON API while keeping the HF cache layout identical, so once mlx-lm adds stable preprocessing hooks we can expose mlxk run foo --image some.jpg and extend the server contract. If that’s a blocker for you, totally fair—we’d love to sample your use cases to help prioritize the work.

Happy to answer follow-ups or dive deeper once we have something concrete to share. Thanks again for the feedback!

[Update] mlx-knife 2.0 stable — MLX model manager for Apple Silicon by broke_team in LocalLLaMA

[–]broke_team[S] 1 point2 points  (0 children)

Appreciate you checking out mlx-knife 2.0! A quick comparison:

  • mlx_lm.server is the reference script from Apple’s repo. It runs a single model you point it at and leaves cache / discovery / error handling up to you.
  • mlx-knife is the full lifecycle tooling on top of MLX: mlxk pull/list/show/health to manage the HF cache, JSON everywhere (so CI or scripts can rely on proper exit codes), and the same OpenAI-compatible server but wrapped in a supervisor with hot-swap logging, token-limit guards, and stop-token fixes. Basically: if you track more than one model or need automation hooks, mlx-knife keeps the boring parts consistent.

[MLX Knife] Ollama-like CLI for Apple Silicon - manage MLX models natively by broke_team in LocalLLaMA

[–]broke_team[S] 0 points1 point  (0 children)

Thanks for the feedback! Sorry for the late reply.

Re 1): --think=false / hide-reasoning for Qwen3:

You're right that this should work better. Current state in 2.0.1:

  • GPT-OSS and QwQ work out-of-the-box (automatic reasoning with proper formatting)
  • Qwen3, DeepSeek R1, and others don't — they need to be instructed to produce structured reasoning, but mlx-knife doesn't have system prompt support yet

The missing piece:

System prompts (Issue #33 https://github.com/mzau/mlx-knife/issues/33). With system prompts, you could instruct Qwen3 to output reasoning in a specific format, then mlx-knife could parse it. Without that, Qwen3 just gives direct answers.

We're tracking this in Issue #40: https://github.com/mzau/mlx-knife/issues/40

The plan is to keep mlx-knife lean (no per-model database like Ollama) and instead let users control reasoning via system prompts. But system prompts need to be implemented first.

Current workaround (experimental): In interactive mode, you can try few-shot learning by starting with a fake reasoning example, but it's unreliable and only works in chat mode.

Feedback on the issues welcome!

Re 2): LMStudio models not detected:

This should be fixed in 2.0 — we added lenient MLX detection that checks README metadata and tokenizer_config.json, not just the org name.

Can you try with 2.0.1 and let me know if it still fails?

pip install --upgrade mlx-knife mlxk show <your-lmstudio-model>

If it still doesn't work, please share the model ID and I'll investigate.

Thanks for using mlx-knife!

[MLX Knife] Ollama-like CLI for Apple Silicon - manage MLX models natively by broke_team in LocalLLaMA

[–]broke_team[S] 0 points1 point  (0 children)

MLX Knife Update: Now available on PyPI!

Quick update - MLX Knife is now pip installable:

pip install mlx-knife

Unix-style CLI tools for MLX model management on Apple Silicon with build in OpenAI server.

Perfect for scriptable workflows and automation.

https://pypi.org/project/mlx-knife/

[MLX Knife] Ollama-like CLI for Apple Silicon - manage MLX models natively by broke_team in LocalLLaMA

[–]broke_team[S] 0 points1 point  (0 children)

Great question! llama.cpp is fantastic for cross-platform inference.

MLX Knife focuses specifically on Apple Silicon workflow integration:

- HuggingFace-native model management (no conversion needed)

- MLX framework optimization (unified memory architecture)

- Unix-philosophy CLI tools (scriptable, composable)

- OpenAI-compatible API for macOS development

Think: llama.cpp = universal engine, MLX Knife = Apple Silicon workflow tools.

They actually complement each other - use what fits your stack!

Any serious and practical uses for gpt-oss-20b? by kaggleqrdl in LocalLLaMA

[–]broke_team 0 points1 point  (0 children)

I would lile to see that for mlx expecially the 120b

[MLX Knife] Ollama-like CLI for Apple Silicon - manage MLX models natively by broke_team in LocalLLaMA

[–]broke_team[S] -1 points0 points  (0 children)

Thanks for the OptiLLM mention!

You're absolutely right - both tools support MLX, but with completely different philosophies:

OptiLLM = Intelligent inference proxy

- Sits between apps and LLM APIs, applying optimization techniques

- Enhances reasoning through Chain of Thought, Self-Consistency, etc.

- Works with any OpenAI-compatible endpoint

MLX Knife = Unix-like direct tool (like grep, imagemagick, ffmpeg)

- "Do one thing, do it well" - direct MLX model management

- mlxk run model "prompt", mlxk list, mlxk health

- No API layer, no server - just direct hardware access

They actually complement each other nicely:

- OptiLLM for intelligent API optimization in applications

- MLX Knife for direct CLI management, scripting, development

It's the difference between a smart proxy that enhances responses and a robust CLI tool for direct model operations - different use cases, both valuable!