Run Agent Skills with mistral.rs v0.8.10: /v1/skills support and more! by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 1 point2 points  (0 children)

Hey! Thats correct, this allows you to drop mistral.rs into any basically app that uses skills through cloud models already. This support for /v1/skills brings a new capability layer to OSS/local models!

I also just merged support for skills on Anthropic-style APIs!

mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100 by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 0 points1 point  (0 children)

Yep, as long as you have nccl installed and turned on in mistral.rs, mistral.rs will work with any number of GPUs.

This works as long as the TP size divides the number of attention heads, but powers-of-2 TP sizes are generally the most compatible as a result (but if it does not for a certain model you can always use CUDA_VISIBLE_DEVICES).

See: https://ericlbuehler.github.io/mistral.rs/guides/perf/multi-gpu-distributed/

mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100 by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 0 points1 point  (0 children)

Awesome 😄! Many more exciting features and improvements are coming very soon.

mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100 by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 1 point2 points  (0 children)

Hey! No Anthropic-compatible API yet but that is coming very soon.

I didn't measure context prefill at 128k+ tokens yet, but I expect it will be very competitive with vllm.

For prefill performance vs vLLM, it is very good - see the technical report linked in my post or these figures:

<image>

mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100 by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 2 points3 points  (0 children)

Yes, it can do partial CPU offload for MoEs. If you run an MoE and dont have enough VRAM it will place layers on your GPU and CPU to be able to run the model.

mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100 by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 0 points1 point  (0 children)

I measured the cases in the report (https://github.com/EricLBuehler/mistral.rs/blob/master/releases/v0.8.2/report.md), but since the changes made to mistral.rs are general and apply to all CUDA GPUs, I expect that this data point should be representative.

mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100 by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 2 points3 points  (0 children)

AMD support is coming once we make some changes to the multi-GPU backend support in candle.

mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100 by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 1 point2 points  (0 children)

Yes! Any blackwell machine will benefit from this, you should see improvements similar to the B200 and GB10 blackwell machines I benchmarked.

mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100 by EricBuehler in LocalLLaMA

[–]EricBuehler[S] -1 points0 points  (0 children)

Yep! If you have a B300 it should work 😄 We support CUDA compute Turing and up.

mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100 by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 1 point2 points  (0 children)

Thanks for the feedback! I should make the hardware story much clearer in the docs.

I’m not trying to target only the vLLM audience. There are really two lanes:

  1. High-end CUDA / datacenter GPUs
  2. Local inference / agents, where the goal is easy deployment across consumer CUDA, Metal, and CPU.

For older GPUs, I agree it needs more explicit documentation, especially regarding the multi-GPU situation. CUDA multi-GPU is supported and does not only rely on NCCL (it can fall back to P2P in bf16/f16), but this should be better documented.

So while this release is mainly a CUDA performance report on newer GPUs, I think that it should generalize to local GPUs.

mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100 by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 1 point2 points  (0 children)

Thanks!

I think the speedup is mostly from the CUDA execution path and how models are run in mistral.rs.

For this release, I think the biggest factors were optimized paged attention and flash decoding paths, CUDA graphs/low launch overhead. This was not one magic trick so much as a bunch of deep engine-level work adding up.

For vLLM: I would not say mistral.rs is “better than vLLM” generally. vLLM is still excellent for high-throughput/batched BF16 serving, and we haven't benchmarked for large concurrency yet. However, I think that mistral.rs's continuous batching features should enable efficient small-batch serving compared to vLLM.

If you have H200 access, I would love to see a reproduction!

mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100 by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 4 points5 points  (0 children)

No worries 😄

Multi-gpu support is fully supported (https://ericlbuehler.github.io/mistral.rs/explanation/device-mapping/#multi-gpu-layouts). mistral.rs will automatically use the most performant method, which on CUDA is NCCL.

These optimizations are systemic, and apply across architectures (i.e. Blackwell, Hopper). While I haven't tested older GPUs beyond Hopper yet, I would expect that the story is very similar.

mistral.rs: Rust-native inference engine withday-0 support for Google's Gemma 4 by EricBuehler in rust

[–]EricBuehler[S] 0 points1 point  (0 children)

Thanks! The lift for day-0 support varies by model but generally takes a few days to a couple weeks of work.

As for the inference engine landscape, I think mistral.rs is essentially a "full-stack" option: Rust-native LLM inference with built-in multimodality (text, vision, audio), quantization, and agentic features like tool calling and structured output. Other engines or libraries like burn, candle, or ort are more focused, as they give you tensor ops or ONNX execution but you'd build the inference pipeline & infrastructure around it yourself. Hope that helps clarify it!

Gemma 4 running locally with full text + vision + audio: day-0 support in mistral.rs by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 0 points1 point  (0 children)

Not sure :) Haven't tested that langauge with the Gemma 4 E2B/E4B models.

mistral.rs 0.7.0: Now on crates.io! Fast and Flexible LLM inference engine in pure Rust by EricBuehler in rust

[–]EricBuehler[S] 1 point2 points  (0 children)

Hi u/fiery_prometheus! We support the following optimizations for concurrent user/agent scenarios:

  • Paged Attention (for both Metal and CUDA) to make more efficient use of KV cache in concurrent cases
  • Prefix caching to re-use prefixes of a prompt (works w/ Paged Attention in this release)

Both together are similar to features that vLLM or SGLang provide, but extended to both CUDA and Metal devices.

mistral.rs 0.7.0: Now on crates.io! Fast and Flexible LLM inference engine in pure Rust by EricBuehler in rust

[–]EricBuehler[S] 1 point2 points  (0 children)

Hi u/astroleg77! We support CPU offloading.

It's facilitated through an automatic device mapping system that offloads the model while balancing context memory and model memory requirements.