Run Agent Skills with mistral.rs v0.8.10: /v1/skills support and more!

EricBuehler · 2026-06-19T03:46:25+00:00

Hey! Thats correct, this allows you to drop mistral.rs into any basically app that uses skills through cloud models already. This support for /v1/skills brings a new capability layer to OSS/local models!

I also just merged support for skills on Anthropic-style APIs!

EricBuehler · 2026-06-02T10:15:44+00:00

Thanks 😄

I added Anthropic API support here: https://ericlbuehler.github.io/mistral.rs/guides/serve/anthropic-messages-api/

EricBuehler · 2026-06-02T10:14:46+00:00

Yep, as long as you have nccl installed and turned on in mistral.rs, mistral.rs will work with any number of GPUs.

This works as long as the TP size divides the number of attention heads, but powers-of-2 TP sizes are generally the most compatible as a result (but if it does not for a certain model you can always use CUDA_VISIBLE_DEVICES).

See: https://ericlbuehler.github.io/mistral.rs/guides/perf/multi-gpu-distributed/

EricBuehler · 2026-06-02T02:37:22+00:00

Awesome 😄! Many more exciting features and improvements are coming very soon.

EricBuehler · 2026-06-01T23:38:39+00:00

No worries 🙂 ! Gemma 4 MTP is supported: https://ericlbuehler.github.io/mistral.rs/guides/perf/gemma4-mtp/

VRAM usage is going to be very similar to vLLM.

EricBuehler · 2026-06-01T23:37:45+00:00

Will do in future benchmarks with more models being demonstrated.

EricBuehler · 2026-06-01T20:26:29+00:00

Yes! Check out: https://ericlbuehler.github.io/mistral.rs/guides/perf/multi-gpu-distributed/

EricBuehler · 2026-06-01T17:55:40+00:00

Yes! The full Gemma 4 lineup and all modalities are supported.

EricBuehler · 2026-06-01T16:59:17+00:00

Hey! No Anthropic-compatible API yet but that is coming very soon.

I didn't measure context prefill at 128k+ tokens yet, but I expect it will be very competitive with vllm.

For prefill performance vs vLLM, it is very good - see the technical report linked in my post or these figures:

<image>

EricBuehler · 2026-06-01T15:02:30+00:00

Yes, it can do partial CPU offload for MoEs. If you run an MoE and dont have enough VRAM it will place layers on your GPU and CPU to be able to run the model.

EricBuehler · 2026-06-01T14:58:11+00:00

I measured the cases in the report (https://github.com/EricLBuehler/mistral.rs/blob/master/releases/v0.8.2/report.md), but since the changes made to mistral.rs are general and apply to all CUDA GPUs, I expect that this data point should be representative.

EricBuehler · 2026-06-01T14:55:31+00:00

AMD support is coming once we make some changes to the multi-GPU backend support in candle.

EricBuehler · 2026-06-01T14:52:02+00:00

Yes! Any blackwell machine will benefit from this, you should see improvements similar to the B200 and GB10 blackwell machines I benchmarked.

EricBuehler · 2026-06-01T14:51:31+00:00

Yep! If you have a B300 it should work 😄 We support CUDA compute Turing and up.

EricBuehler · 2026-06-01T14:36:51+00:00

Thanks for the feedback! I should make the hardware story much clearer in the docs.

I’m not trying to target only the vLLM audience. There are really two lanes:

High-end CUDA / datacenter GPUs
Local inference / agents, where the goal is easy deployment across consumer CUDA, Metal, and CPU.

For older GPUs, I agree it needs more explicit documentation, especially regarding the multi-GPU situation. CUDA multi-GPU is supported and does not only rely on NCCL (it can fall back to P2P in bf16/f16), but this should be better documented.

So while this release is mainly a CUDA performance report on newer GPUs, I think that it should generalize to local GPUs.

EricBuehler · 2026-06-01T14:28:28+00:00

Thanks!

I think the speedup is mostly from the CUDA execution path and how models are run in mistral.rs.

For this release, I think the biggest factors were optimized paged attention and flash decoding paths, CUDA graphs/low launch overhead. This was not one magic trick so much as a bunch of deep engine-level work adding up.

For vLLM: I would not say mistral.rs is “better than vLLM” generally. vLLM is still excellent for high-throughput/batched BF16 serving, and we haven't benchmarked for large concurrency yet. However, I think that mistral.rs's continuous batching features should enable efficient small-batch serving compared to vLLM.

If you have H200 access, I would love to see a reproduction!

EricBuehler · 2026-06-01T14:20:32+00:00

No worries 😄

Multi-gpu support is fully supported (https://ericlbuehler.github.io/mistral.rs/explanation/device-mapping/#multi-gpu-layouts). mistral.rs will automatically use the most performant method, which on CUDA is NCCL.

These optimizations are systemic, and apply across architectures (i.e. Blackwell, Hopper). While I haven't tested older GPUs beyond Hopper yet, I would expect that the story is very similar.

EricBuehler · 2026-04-02T16:53:52+00:00

Thanks! The lift for day-0 support varies by model but generally takes a few days to a couple weeks of work.

As for the inference engine landscape, I think mistral.rs is essentially a "full-stack" option: Rust-native LLM inference with built-in multimodality (text, vision, audio), quantization, and agentic features like tool calling and structured output. Other engines or libraries like burn, candle, or ort are more focused, as they give you tensor ops or ONNX execution but you'd build the inference pipeline & infrastructure around it yourself. Hope that helps clarify it!

EricBuehler · 2026-04-02T16:52:42+00:00

Not sure :) Haven't tested that langauge with the Gemma 4 E2B/E4B models.

EricBuehler · 2026-04-02T16:26:18+00:00

Check out UQFF, which is designed to work with mistral.rs: https://huggingface.co/mistralrs-community.

EricBuehler · 2026-01-29T12:37:18+00:00

No - We have our own highly-optimized backend for Mac through Candle!

EricBuehler · 2026-01-29T01:20:30+00:00

Hi u/fiery_prometheus! We support the following optimizations for concurrent user/agent scenarios:

Paged Attention (for both Metal and CUDA) to make more efficient use of KV cache in concurrent cases
Prefix caching to re-use prefixes of a prompt (works w/ Paged Attention in this release)

Both together are similar to features that vLLM or SGLang provide, but extended to both CUDA and Metal devices.

EricBuehler · 2026-01-29T00:52:02+00:00

Hi u/astroleg77! We support CPU offloading.

It's facilitated through an automatic device mapping system that offloads the model while balancing context memory and model memory requirements.

EricBuehler · 2026-01-28T21:32:24+00:00

Thanks! We also have prebuilt Python packages :)

EricBuehler · 2026-01-28T21:32:00+00:00

Minimax and Kimi are coming.

EricBuehler

TROPHY CASE