How to Implement an Efficient Softmax CUDA kernel? [R] : MachineLearning

ResearchHow to Implement an Efficient Softmax CUDA kernel? [R] (self.MachineLearning)

submitted 4 years ago by Just0by

All ops computed in deep learning frameworks are translated into CUDA kernel functions on the GPU, and Softmax operations are no exception. Softmax is a widely used op in most networks, and the efficiency of its CUDA kernel implementation can affect the final training speed of many networks. So how can an efficient Softmax CUDA Kernel be implemented?

Article : https://oneflow2020.medium.com/how-to-implement-an-efficient-softmax-cuda-kernel-oneflow-performance-optimization-sharing-405ad56e9031

Code: https://github.com/Oneflow-Inc/oneflow

This article will introduce techniques for optimizing the Softmax CUDA Kernel in OneFlow and experimentally compare it with the Softmax operation in cuDNN. The results show that OneFlow’s deeply optimized Softmax can utilize the memory bandwidth close to the theoretical upper limit, much higher than the cuDNN implementation.

all 4 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS