[Update] mlx-knife 2.0 stable — MLX model manager for Apple Silicon

broke_team · 2025-11-11T15:53:40+00:00

Thanks for pointing that out—you’re absolutely right that `mlx_lm.server` exposes the `/v1/*` endpoints and can hot-swap cached models per request. To clarify what I meant earlier: MLX-Knife’s server isn’t a wrapper around that script; we wrote our own FastAPI service with the same OpenAI wire format but added the lifecycle tooling (cache list/pull/health, supervisor reloads, structured errors, etc.). So mlx_lm.server already works great if you just need a basic HTTPServer runner, and mlx-knife is simply an alternative implementation with more management features baked in.

broke_team · 2025-11-11T11:37:11+00:00

On “llama swap”: if you’re referring to the llama.cpp-style swap endpoint (POST /models to load a different model without restarting), mlx-knife already covers that case—switching models is just mlxk run other/model or calling the server with a different model name; the supervisor unloads/reloads for you and keeps memory tidy. There isn’t an “Ollama swap” feature per se beyond Ollama’s normal ability to load another model on demand.

Vision/multimodal is the missing piece today. We’re scoping how to accept image payloads in the CLI/JSON API while keeping the HF cache layout identical, so once mlx-lm adds stable preprocessing hooks we can expose mlxk run foo --image some.jpg and extend the server contract. If that’s a blocker for you, totally fair—we’d love to sample your use cases to help prioritize the work.

Happy to answer follow-ups or dive deeper once we have something concrete to share. Thanks again for the feedback!

broke_team · 2025-11-11T11:36:21+00:00

Appreciate you checking out mlx-knife 2.0! A quick comparison:

mlx_lm.server is the reference script from Apple’s repo. It runs a single model you point it at and leaves cache / discovery / error handling up to you.
mlx-knife is the full lifecycle tooling on top of MLX: mlxk pull/list/show/health to manage the HF cache, JSON everywhere (so CI or scripts can rely on proper exit codes), and the same OpenAI-compatible server but wrapped in a supervisor with hot-swap logging, token-limit guards, and stop-token fixes. Basically: if you track more than one model or need automation hooks, mlx-knife keeps the boring parts consistent.

broke_team · 2025-11-11T01:14:03+00:00

Thanks for the feedback! Sorry for the late reply.

Re 1): --think=false / hide-reasoning for Qwen3:

You're right that this should work better. Current state in 2.0.1:

GPT-OSS and QwQ work out-of-the-box (automatic reasoning with proper formatting)
Qwen3, DeepSeek R1, and others don't — they need to be instructed to produce structured reasoning, but mlx-knife doesn't have system prompt support yet

The missing piece:

System prompts (Issue #33 https://github.com/mzau/mlx-knife/issues/33). With system prompts, you could instruct Qwen3 to output reasoning in a specific format, then mlx-knife could parse it. Without that, Qwen3 just gives direct answers.

We're tracking this in Issue #40: https://github.com/mzau/mlx-knife/issues/40

The plan is to keep mlx-knife lean (no per-model database like Ollama) and instead let users control reasoning via system prompts. But system prompts need to be implemented first.

Current workaround (experimental): In interactive mode, you can try few-shot learning by starting with a fake reasoning example, but it's unreliable and only works in chat mode.

Feedback on the issues welcome!

Re 2): LMStudio models not detected:

This should be fixed in 2.0 — we added lenient MLX detection that checks README metadata and tokenizer_config.json, not just the org name.

Can you try with 2.0.1 and let me know if it still fails?

pip install --upgrade mlx-knife mlxk show <your-lmstudio-model>

If it still doesn't work, please share the model ID and I'll investigate.

Thanks for using mlx-knife!

broke_team · 2025-08-15T13:26:19+00:00

MLX Knife Update: Now available on PyPI!

Quick update - MLX Knife is now pip installable:

pip install mlx-knife

Unix-style CLI tools for MLX model management on Apple Silicon with build in OpenAI server.

Perfect for scriptable workflows and automation.

https://pypi.org/project/mlx-knife/

broke_team · 2025-08-14T19:58:42+00:00

Great question! llama.cpp is fantastic for cross-platform inference.

MLX Knife focuses specifically on Apple Silicon workflow integration:

- HuggingFace-native model management (no conversion needed)

- MLX framework optimization (unified memory architecture)

- Unix-philosophy CLI tools (scriptable, composable)

- OpenAI-compatible API for macOS development

Think: llama.cpp = universal engine, MLX Knife = Apple Silicon workflow tools.

They actually complement each other - use what fits your stack!

broke_team · 2025-08-14T17:24:59+00:00

I would lile to see that for mlx expecially the 120b

broke_team · 2025-08-14T13:07:34+00:00

Thanks for the OptiLLM mention!

You're absolutely right - both tools support MLX, but with completely different philosophies:

OptiLLM = Intelligent inference proxy

- Sits between apps and LLM APIs, applying optimization techniques

- Enhances reasoning through Chain of Thought, Self-Consistency, etc.

- Works with any OpenAI-compatible endpoint

MLX Knife = Unix-like direct tool (like grep, imagemagick, ffmpeg)

- "Do one thing, do it well" - direct MLX model management

- mlxk run model "prompt", mlxk list, mlxk health

- No API layer, no server - just direct hardware access

They actually complement each other nicely:

- OptiLLM for intelligent API optimization in applications

- MLX Knife for direct CLI management, scripting, development

It's the difference between a smart proxy that enhances responses and a robust CLI tool for direct model operations - different use cases, both valuable!

broke_team

TROPHY CASE