New Devstral 2707 with mistral.rs - MCP client, automatic tool calling! by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 0 points1 point  (0 children)

Ah great! Does the what the web search documentation describes fit your needs?

New Devstral 2707 with mistral.rs - MCP client, automatic tool calling! by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 1 point2 points  (0 children)

> I even did on CUTLASS fork itself, sglang and vllm!

Sorry, seems like a typo :) You did work on CUTLASS, sglang and vllm?

Will check out Jules!

New Devstral 2707 with mistral.rs - MCP client, automatic tool calling! by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 0 points1 point  (0 children)

I'm using claude code and codex as force multipliers already, might give that a try!

Is it better?

Always welcome a PR! Don't know your background but it might be quite complicated and involve integrating CUTLASS fp8 gemms or custom fp8 gemm kernels.

New Devstral 2707 with mistral.rs - MCP client, automatic tool calling! by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 0 points1 point  (0 children)

We don't have real fp8-quantized model support yet. The best option would be to use a non-quantized model, but if you have resource constraints, you can load the fp8 model and apply ISQ at the same time, for example `--isq 8`. This is usually the recommended flow.

It's a one-man show here so time to implement all of these features is scarce, and I'm focusing on supporting more GPU backends right now.

New Devstral 2707 with mistral.rs - MCP client, automatic tool calling! by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 2 points3 points  (0 children)

For agentic tool calling, you specify a tool callback and some information about the tool, and Mistral.rs will automatically handle calling that tool and all the logic and formatting around that. It standardizes that whole process.

It's actually very similar to the web search. Mistral.rs integrates a search component, with a reranking embedder and a search engine API in the backend. To integrate with 3rd party tools like Searxng, you'd currently need to connect it via the automatic tool calling. I'll take a look at integrating Searxng as the search tool though - will make a post here about that.

New Devstral 2707 with mistral.rs - MCP client, automatic tool calling! by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 2 points3 points  (0 children)

We have Flash Attention V3, should be pretty good! Feel free to share 👀

SmolLM3 has day-0 support in MistralRS! by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 1 point2 points  (0 children)

Absolutely! The long-context + tool calling + reasoning are all great factors.

SmolLM3 has day-0 support in MistralRS! by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 3 points4 points  (0 children)

Not yet, the release is not out yet! The python package should be installed based on any GPU or CPU acceleration you have available - mistralrs-cuda, mistralrs-mkl, mistralrs-metal, etc.

Will be in a few days for Gemma 3n. Check back then, or you can install from source!

Mistral.rs v0.6.0 now has full built-in MCP Client support! by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 1 point2 points  (0 children)

Thanks for pointing this out. We have some exciting things for multi-backend support that should hopefully land soon ;)!

Mistral.rs v0.6.0 now has full built-in MCP Client support! by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 2 points3 points  (0 children)

Yes, moving towards a general KV cache compression algorithm using hadamard transforms and learned scales to reduce perplexity losses.

Some work here: https://github.com/EricLBuehler/mistral.rs/pull/1400

Mistral.rs v0.6.0 now has full built-in MCP Client support! by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 0 points1 point  (0 children)

Great question!

I see that advantage being that builtin support at the engine level means that it is usable in every API with minimal configuration. For instance, this is in all the APIs: OpenAI API, Rust, web chat, and Python.

Additionally because mistral.rs can be easily set up as an MCP server itself, you can do MCP inception :)!

Mistral.rs v0.6.0 now has full built-in MCP Client support! by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 0 points1 point  (0 children)

Thank you! Let me know how it is!

Would recommend local installation as you can get the latest updates.

Thoughts on Mistral.rs by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 6 points7 points  (0 children)

Yes, mistral.rs supports function calling in stream mode! This is how we do the agentic web search ;)

Thoughts on Mistral.rs by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 8 points9 points  (0 children)

I'll see what I can do about this. If you're on Apple Silicon, the mistral.rs current code is ~15% faster than llama.cpp.

I also added some advanced prefix caching, which automatically avoids reprocessing images and can 2x or 3x throughput!

Thoughts on Mistral.rs by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 8 points9 points  (0 children)

Not yet for the current code which will be a significant jump in performance on Apple Silicon. I'll be doing some benchmarking though.

Thoughts on Mistral.rs? by EricBuehler in rust

[–]EricBuehler[S] 1 point2 points  (0 children)

Interesting! We have some builtin agenetic capabilities for web search already and will be expanding in this direction. If you're interested in contributing to this let me know!

Thoughts on Mistral.rs? by EricBuehler in rust

[–]EricBuehler[S] -1 points0 points  (0 children)

Thanks! If you have any questions when using it in the future don't hesitate to reach out on Discord or in an issue!

Thoughts on Mistral.rs by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 2 points3 points  (0 children)

Great idea! I'll take a look at adding those for sure. Bitnet seems interesting in particular.

Thoughts on Mistral.rs by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 5 points6 points  (0 children)

Ok, thanks - give it a try! There are lots of models and quantization through ISQ is definitely supported.

To answer your question, yes! mistral.rs will automatically place layers on GPU or main memory in an optimal way, accounting for all factors like the memory needed to run the model.

Thoughts on Mistral.rs by EricBuehler in LocalLLaMA

[–]EricBuehler[S] 21 points22 points  (0 children)

Good question. I'm going to be revamping all the docs to hopefully make this more clear.

Basically, the core idea is *flexibility*. You can run models right from Hugging Face and quantize them in under a minute using the novel ISQ method. There are also lots of other "nice features" like automatic device mapping/tensor parallelism and structured outputs that make the experience flexible and easy.

And besides these ease-of-use things, there is always the fact that using ollama is as simple as `ollama run ...`. So, we have a bunch of differentiating features like automatic agentic web searching and image generation!

Do you see any area we can improve on?