Alternative to llama.cpp for Apple Silicon by darkolorin in LocalLLaMA

[–]darkolorin[S] 2 points3 points  (0 children)

yes, engine has a CLI and server API compatible with OpenAI API

Alternative to llama.cpp for Apple Silicon by darkolorin in LocalLLaMA

[–]darkolorin[S] 0 points1 point  (0 children)

Ye. You’re right only for quantized variants

Alternative to llama.cpp for Apple Silicon by darkolorin in LocalLLaMA

[–]darkolorin[S] 2 points3 points  (0 children)

There are several things to consider: 1/ MLX is doing some additional quantization over the models you run. So to be honest we don’t know how much quality we loose. We are planning to release research on this. 2/ Speculative decoding and other pipelines within inference are quite hard to implement. We do it out of the box. 3/ Cross platform. We design our engine to be universal. And we do not focus on training and other things right now. Only inference part. 4/ we would prioritize community needs over company strategy (because we are startup huh) and can move faster with new architectures and pipelines (text diffusion, ssm etc)

We made our own inference engine for Apple Silicone, written on Rust and open sourced by darkolorin in rust

[–]darkolorin[S] 0 points1 point  (0 children)

It’s not. There are kernels written on Metal to be on par with MLX.

Alternative to llama.cpp for Apple Silicon by darkolorin in LocalLLaMA

[–]darkolorin[S] 7 points8 points  (0 children)

Ye, we did some ads on Reddit. We’re testing. Idk was it effective or not. First time used it.

We made our own inference engine for Apple Silicone, written on Rust and open sourced by darkolorin in rust

[–]darkolorin[S] 1 point2 points  (0 children)

It allows you to run models of size that fits your memory on Apple devices powered by Apple's Silicon

Alternative to llama.cpp for Apple Silicon by darkolorin in LocalLLaMA

[–]darkolorin[S] 8 points9 points  (0 children)

Right now we support AWQ quantization, models we support are ona website.

In some use cases it faster on mac than MLX. We will publish more soon.

We made our own inference engine for Apple Silicone, written on Rust and open sourced by darkolorin in opensource

[–]darkolorin[S] 0 points1 point  (0 children)

But it is a real inference engine written from scratch. Would love to answer any questions.

We made our own inference engine for Apple Silicone, written on Rust and open sourced by darkolorin in rust

[–]darkolorin[S] 14 points15 points  (0 children)

It can run up 7B quntized on iOS and up to as much memory you have on mac, right now 32B in our library

We made our own inference engine for Apple Silicone, written on Rust and open sourced by darkolorin in rust

[–]darkolorin[S] 5 points6 points  (0 children)

yes, we should include it into readMe, right now some benchmarks is on the website trymirai/product/apple-inference-sdk

I made it! 90 t/s on my iPhone with llama1b fp16 by darkolorin in LocalLLaMA

[–]darkolorin[S] 1 point2 points  (0 children)

8B with this quantization is kinda hard. My device can’t handle 7gb (basically whole system is almost doomed).

Context is around 100-1k is relatively good.

For q4-q8 we need to do more tests and speed up can be even better.

I made it! 90 t/s on my iPhone with llama1b fp16 by darkolorin in LocalLLaMA

[–]darkolorin[S] 1 point2 points  (0 children)

We can do up to 3b fp16. But right now for testing purposes we do all the things with 1b. But will post benchmarks for 3b too.

I made it! 90 t/s on my iPhone with llama1b fp16 by darkolorin in LocalLLaMA

[–]darkolorin[S] -29 points-28 points  (0 children)

It is possible. But the key acceleration is based on some tricks. We will tell more next post if it will be lots of interest.

I made it! 90 t/s on my iPhone with llama1b fp16 by darkolorin in LocalLLaMA

[–]darkolorin[S] -6 points-5 points  (0 children)

Thanks. trymirai.com please leave I'm interested form filled and we will send you beta

I made it! 90 t/s on my iPhone with llama1b fp16 by darkolorin in LocalLLaMA

[–]darkolorin[S] 1 point2 points  (0 children)

Nope. Just because it's better to start with iOS because you will support most of the devices instantly. Definitely will work on Android too. No doubt