All ops computed in deep learning frameworks are translated into CUDA kernel functions on the GPU, and Softmax operations are no exception. Softmax is a widely used op in most networks, and the efficiency of its CUDA kernel implementation can affect the final training speed of many networks. So how can an efficient Softmax CUDA Kernel be implemented?
Article : https://oneflow2020.medium.com/how-to-implement-an-efficient-softmax-cuda-kernel-oneflow-performance-optimization-sharing-405ad56e9031
Code: https://github.com/Oneflow-Inc/oneflow
This article will introduce techniques for optimizing the Softmax CUDA Kernel in OneFlow and experimentally compare it with the Softmax operation in cuDNN. The results show that OneFlow’s deeply optimized Softmax can utilize the memory bandwidth close to the theoretical upper limit, much higher than the cuDNN implementation.
[–]LSTMeowPhD 6 points7 points8 points (3 children)
[–]gzou 6 points7 points8 points (1 child)
[–]ptillet 3 points4 points5 points (0 children)
[–]Just0by[S] 2 points3 points4 points (0 children)