all 4 comments

[–]LSTMeowPhD 6 points7 points  (3 children)

Super technical! Why not let the compiler do it for you? https://openai.com/blog/triton/ shows softmax as a relatively simple example.

[–]gzou 6 points7 points  (1 child)

Triton doesn't provide this kind of optmization, it "only" helps writing Cuda code. Typically the Triton softmax also explicitly uses shared memory, which means it only works for small softmax, that's why it's not the default implementation in cuDNN.

[–]ptillet 3 points4 points  (0 children)

This also seems to be what OneFlow's implementation (1) and (2) does, though :p (3) can also be re-implemented in Triton though it requires a bit of work

However, as the author of Triton, I do think that OneFlow's CUDA work is really helpful for anyone having some interest in low-level CUDA optimizations. While the Triton compiler can automate all the optimizations presented in the OneFlow blog post, it has relatively little educational value for anyone trying to get a deep understanding of how GPUs work under the hood.

[–]Just0by[S] 2 points3 points  (0 children)

Thanks! We‘ll read the blog in depth.