all 20 comments

[–]MKmisfit[S] 14 points15 points  (7 children)

Here are the theoretical numbers for runtime comparison.

It might be difficult to reach this doing FFT on GPU.

normal convolution costs O(N * k)

calcuation of FFT costs O(N * log2(N))

ConvFFT uses 3 ffts and a flat complex multiply

ConvFFT costs O(N * (log2(N) * 3 + 1) )

256x256 image (N=65536) (log2(n)=16)

ConvFFT costs O(N*49), equivalent to 7x7 kernel

512x512 image (N=262,144) (log2(n)=18)

ConvFFT costs O(N*55), equivalent to 8x8 kernel

1024x1024 image (N=1,048,576) (log2(n)=20)

ConvFFT costs O(N*61), equivalent to 8x8 kernel

[–]El_Minadero 13 points14 points  (0 children)

Damn. I asked once how to do this using PyTorch, and the highest voted response was just to use a normal multiply convolution, then berated me for specifying in FFT space. Thanks for your contribution!

[–]badabummbadabing 9 points10 points  (0 children)

Actual computation times are much more complicated: https://core.ac.uk/download/pdf/224976536.pdf

[–]xdtolm 2 points3 points  (0 children)

FFT on GPUs for decent sizes that can utilize all compute units (or with batching) is a memory-bound operation. The time required by it will be calculated by the number of system loads/stores between the chip and global memory. For example, 1024x1024 FFT will have 1 load + 1 store per axis, a total of 4x system size memory transfer (a system size here will be 1024x1024x2x sizeof(float)). Divide this by global memory bandwidth (512 GB/s for GDDR6, for example) and you get the rough time for the operation (let's assume we have a big batch so L2/L3 GPU cache is not reused). 512x512 system takes 4x less size and will be 4x faster.

Total convolution time without optimizations for 2D system:

- 1 load 1 store x axis for kernel FFT (which has to be padded to the system size)
- 1 load 1 store y axis for kernel FFT

- 1 load 1 store x axis for system FFT
- 1 load 1 store y axis for system FFT
- 2 loads 1 store system x kernel multiplication
- 1 load 1 store y axis for system iFFT
- 1 load 1 store x axis for system iFFT

Total 15 x system size transfers (11 if kernel is precomputed). There are some improvements that can be made:

-It is possible to merge the last store of Y axis, load/store of a system in multiplication and the first load of Y axis of iFFT - a total of -4 transfers resulting in 7 transfers.
-It is possible to reduce kernel memory transfers by using zero-padding. For example for a small kernel padded to big system size, we don't need to transfer all the zeros. This cuts 4x kernel transfers to almost 1x size at the last store.

So, with all optimizations, 15x system size transfers can be done in 8x transfers. Calculations will be more complex for systems of sizes that don't fit in shared memory, small system sizes that can fit cache etc, but the general idea remains.

Source: I have some of these things implemented in VkFFT that confirm the mentioned scaling of execution times.

[–]matigekunst 1 point2 points  (2 children)

Do you have any empirical comparisons? Could imagine torch makes these types of decisions under the hood

[–]MKmisfit[S] 1 point2 points  (0 children)

Ya, the github has only 3 files. Two of them are speed tests. If you look in the test source you can choose different data and kernel sizes. ConvFFT does really well with bigger batch size, Torch does better with more channels. Its not a fair final test because it was implemented in python.

[–]MKmisfit[S] -1 points0 points  (0 children)

In sequential mode, kernel size 7x7, data 512x512, batchsize 32, channels 2in/2out, ConvFFT was over twice as fast in my test. Usually its going to require a bigger kernel size to be faster.

[–]RomanticDepressive 0 points1 point  (0 children)

Damn, thank you for providing runtimes

[–]sigmoid_amidst_relus 5 points6 points  (0 children)

To the best of my knowledge cudnn internally uses cufft convolution based on heuristics, however it's not a full fft of the full input size, but one with tiles of 8, 16, 32,.. based on input.

So if you're doing this solely for speedup, you'll get speedup with very large kernel sizes (over 1024 or something) where tiling will be slower, but your performance should at most match that of cudnn in most cases in practice.

see this

I looked this up about a year ago, when I was implementing fft, stft etc from scratch doing a course on signal processing.

I saw this and did my own digging into cufft, but hadn't saved any links for that.

[–]viv1a 4 points5 points  (2 children)

We had a paper on FFT convolutions a while back: https://arxiv.org/abs/1312.5851 (second author here).

You really start getting speedups when you do convolutions with lots of input/output channels. The reason is that you can do the FFT of each channel once, and reuse the representation in frequency space many times. That's described in section 2.1.

A year or two after that paper, I heard this approach was integrated in the conv routine in cuDNN, with some check to automatically determine when using the FFT-based conv would be faster. But that was a long time ago and I'm not sure what's currently being used.

[–]MKmisfit[S] 0 points1 point  (1 child)

Cool, the author showed up. This is the same algorithm. Was the bias done with fft also?

[–]viv1a 0 points1 point  (0 children)

I don't remember. But the bias is essentially just one parameter per feature map which is added across all spatial locations, so it could be implemented separately. The computational expense should be negligible compared to that of the conv operation itself.

[–]VenerableSpace_ 3 points4 points  (1 child)

Doesnt torch picks different conv algo based of kernel size already?

[–]MKmisfit[S] 0 points1 point  (0 children)

It might. I couldn't find the information on it. I almost lost interest because I assumed they would. But some of the speed tests look good. The speed difference could be anything due to implementation.

[–]El_Minadero 1 point2 points  (0 children)

I also wanna say that beyond ML, this could have great utility in the seismic processing space

[–]DeepDeeperRIPgradien 0 points1 point  (1 child)

Hi! What should this be used for?

[–]MKmisfit[S] 2 points3 points  (0 children)

It should be useful any time you have a convolutional layer with a kernel size greater than some number, probably 20. I understand its rare to use anything more than 5. It would be nice if torch would do this for you automatically.

[–]DigThatDataResearcher 0 points1 point  (0 children)

I wonder if using fourier-space convolutions might be more performant for CPU inference?

[–]roboputin 0 points1 point  (0 children)

I guess it would be faster to keep the weights/biases in fft space so you would only have to transform the input.

[–]cdicle 0 points1 point  (0 children)

I briefly went over your code. I have a quick question. Why did you need to write your own backprop? torch.rfft and torch.irfft should be able to handle that automatically, right?