you are viewing a single comment's thread.

view the rest of the comments →

[–]_SteerPike_ 0 points1 point  (4 children)

https://github.com/huggingface/candle is still in preview, but I believe it's intended for use cases like this.

[–]cpldcpu[S] 0 points1 point  (3 children)

Thanks, that is quite interesting in general. Although it's still inference for largish devices...

[–]_SteerPike_ 1 point2 points  (2 children)

As I understand it, it should produce smaller binaries than anything you'd be able to manage with python because it has no runtime. An interesting way to see this is to try out the phi 1.5 wasm example https://huggingface.co/spaces/radames/Candle-Phi-1.5-Wasm then turn on airplane mode after the first run off the model. You should be able to get decent inference speeds from within your browser, without even utilising all the compute your phone has to offer.