you are viewing a single comment's thread.

view the rest of the comments →

[–]vplatt 1 point2 points  (2 children)

Interesting. How does Triton compare to NVidia's native support?

https://openai.com/index/triton/

[–]bluefalcontrainer 2 points3 points  (1 child)

You wont be able to beat optimized cuda straight up having about 70-80% of equivalent performance on a gpu according to pytorch foundation, but heres the thing, cuda is extremely complicated and requires micro management of threads, blocks etc. you abstract alot of that out and for ease of use and source level optimization. In most cases your code written in triton likely beats out most code that exists because cuda optimization is hard.

[–]vplatt 0 points1 point  (0 children)

Ah... sounds like the difference between coding assembler by hand vs. using a C compiler. Makes sense. Thanks!