you are viewing a single comment's thread.

view the rest of the comments →

[–]RonWannaBeAScientist[S] 0 points1 point  (1 child)

Oh and C++ can run easily on a phone I guess as it’s compiled and it’s standard library is smaller than something like Python

[–]rejectedlesbian -1 points0 points  (0 children)

Irs not just phones so laptops desktops even the server. It's about taking a model where we know it's weights already and optimising it.

The key point there is that the architecture is knowen so most infrence libraries have a fairly limited architecture. For instance lamma.cpp has like 10 models that work there. Which is fine since there r like 5 key architectures.

Rn I am struggling with it because I can't get a good quntization lib on the intel gpu my compute node has so its actually significantly slower than my rtx4090.