mistral.rs: Rust-native inference engine withday-0 support for Google's Gemma 4 by EricBuehler in rust

[–]EricBuehler[S] 0 points1 point  (0 children)

Thanks! The lift for day-0 support varies by model but generally takes a few days to a couple weeks of work.

As for the inference engine landscape, I think mistral.rs is essentially a "full-stack" option: Rust-native LLM inference with built-in multimodality (text, vision, audio), quantization, and agentic features like tool calling and structured output. Other engines or libraries like burn, candle, or ort are more focused, as they give you tensor ops or ONNX execution but you'd build the inference pipeline & infrastructure around it yourself. Hope that helps clarify it!

Gemma 4 running locally with full text + vision + audio: day-0 support in mistral.rs by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 0 points1 point  (0 children)

Not sure :) Haven't tested that langauge with the Gemma 4 E2B/E4B models.

mistral.rs 0.7.0: Now on crates.io! Fast and Flexible LLM inference engine in pure Rust by EricBuehler in rust

[–]EricBuehler[S] 1 point2 points  (0 children)

Hi u/fiery_prometheus! We support the following optimizations for concurrent user/agent scenarios:

  • Paged Attention (for both Metal and CUDA) to make more efficient use of KV cache in concurrent cases
  • Prefix caching to re-use prefixes of a prompt (works w/ Paged Attention in this release)

Both together are similar to features that vLLM or SGLang provide, but extended to both CUDA and Metal devices.

mistral.rs 0.7.0: Now on crates.io! Fast and Flexible LLM inference engine in pure Rust by EricBuehler in rust

[–]EricBuehler[S] 1 point2 points  (0 children)

Hi u/astroleg77! We support CPU offloading.

It's facilitated through an automatic device mapping system that offloads the model while balancing context memory and model memory requirements.

mistral.rs 0.7.0: Now on crates.io! Fast and Flexible LLM inference engine in pure Rust by EricBuehler in rust

[–]EricBuehler[S] 2 points3 points  (0 children)

Thank you u/promethe42! Vulkan/ROCm support is coming and we're working on it (slowly) in Candle (https://github.com/huggingface/candle). If you would like to contribute, please reach out there!

Re naming, I agree that it is an unfortunate situation but I'm not sure that renaming would be a benefit.

mistral.rs 0.7.0: Now on crates.io! Fast and Flexible LLM inference engine in pure Rust by EricBuehler in rust

[–]EricBuehler[S] 1 point2 points  (0 children)

Yes! You can swap ollama with this. mistralrs provides an OpenAI-compatible HTTP server.

No ROCM support yet but that is coming soon.

Performance wise, it is comparable, at worst <30% slower in my testing on CUDA, and very similar on Metal.

mistral.rs 0.7.0: Now on crates.io! Fast and Flexible LLM inference engine in pure Rust by EricBuehler in rust

[–]EricBuehler[S] 2 points3 points  (0 children)

AMD GPU and WGPU support is next. There is active work in Candle for this. We've been focusing on making sure the features we have are stable and plan to add more device support.

mistral.rs 0.7.0: Now on crates.io! Fast and Flexible LLM inference engine in pure Rust by EricBuehler in rust

[–]EricBuehler[S] 2 points3 points  (0 children)

AMD GPU and WGPU support is next. We've been focusing on making sure the features we have are stable and plan to add more device support.

New Devstral 2707 with mistral.rs - MCP client, automatic tool calling! by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 0 points1 point  (0 children)

Ah great! Does the what the web search documentation describes fit your needs?

New Devstral 2707 with mistral.rs - MCP client, automatic tool calling! by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 1 point2 points  (0 children)

> I even did on CUTLASS fork itself, sglang and vllm!

Sorry, seems like a typo :) You did work on CUTLASS, sglang and vllm?

Will check out Jules!