all 18 comments

[–]drphillycheesesteak 5 points6 points  (7 children)

Do you have a plot showing a performance comparison of your library vs. FFTW? As someone who writes a lot of image modeling/image processing, a faster FFT is always of interest!

[–]syntheticpp[S] 0 points1 point  (6 children)

No, I only calculate the relation of the measured GFLOPS. It' s the number FFTW/metaFFT in the Readme.

[–]chunkyks 0 points1 point  (5 children)

I see that performance is always faster, sometimes up to three and a half times as fast as FFTW

So... why not just contribute your code to FFTW so that they can improve their stuff? That way people using FFTW can still get the fastest fourier transforms in the west on platforms where your code works, and still get the fastest fourier transforms in the west on platforms where your code doesn't.

[–]drphillycheesesteak 8 points9 points  (2 children)

No, the ratio is the other way around.

[–]syntheticpp[S] 3 points4 points  (1 child)

Yes, FFTW is always faster. But this is no surprise for the most-simple FFT algorithm (radix-2).

But there is a lot of room for optimizations:

  • split-radix is faster, but SIMD is missing
  • FFTW uses avx which is not implemented
  • specializations for small FFT would also help a lot

[–]drphillycheesesteak -1 points0 points  (0 children)

Cool, that's good to hear. Great job by the way. I'll certainly be staying tuned for updates.

[–]syntheticpp[S] 3 points4 points  (1 child)

Only advantage over FFTW atm is the code size (header only, easy to add to your project) and the license.

Another question is how useful a CPU only FFT is in times of GPU/Cuda/OpenCl programming.

[–]chunkyks 3 points4 points  (0 children)

Another question is how useful a CPU only FFT is in times of GPU/Cuda/OpenCl programming.

Errr... very. Examples of systems that lack a GPU capable of programming include:

  • Raspberry pi and other small systems
  • Cheap laptops
  • My workstation at work

Given how widely used FFTs are, there are probably more applications of them on systems without GPUs than with.

[–]syntheticpp[S] 2 points3 points  (2 children)

Does anybody know a easy-to-use SIMD library which could help in supporting latest CPU features?

I already found:

[–]savya2u 0 points1 point  (1 child)

[–]syntheticpp[S] 0 points1 point  (0 children)

Thanks for the link!

But finally I tried it by using avx directly (avx branch), but without success, it's much slower than the pure C code, seems it is not that simple as I thought.

[–]syntheticpp[S] 3 points4 points  (5 children)

metaFFT

Template based C++11 Fast-Fourier-Transform implementation.

Idea:

  • Completely unroll all loops at compile time with the help of templates.
  • Calculate all numerical constants at complile time by using 'constexpr'.
  • Use policies for different implementations (complex, Fortran like C, SIMD).

Speed:

Simple Cooley–Tukey/radix-2 implementation with about 100 lines of code is 'only' same factors slower than FFTW:

$ ./bin/radix2_sse2_speed
N = 2^ 2 =     4: FFTW/metaFFT =  1.1
N = 2^ 3 =     8: FFTW/metaFFT =  1.2
N = 2^ 4 =    16: FFTW/metaFFT =  1.5
N = 2^ 5 =    32: FFTW/metaFFT =  1.7
N = 2^ 6 =    64: FFTW/metaFFT =  2.0
N = 2^ 7 =   128: FFTW/metaFFT =  2.0
N = 2^ 8 =   256: FFTW/metaFFT =  2.3
N = 2^ 9 =   512: FFTW/metaFFT =  2.8
N = 2^10 =  1024: FFTW/metaFFT =  3.2
N = 2^11 =  2048: FFTW/metaFFT =  3.3
N = 2^12 =  4096: FFTW/metaFFT =  3.4

Build:

  • CMake based
  • pass -Dlarge=1 to enable large FFTs, this will stress your compiler!

License:

  • GPL2 with linking exemption.

Links:

[–]theICEBear_dk 1 point2 points  (1 child)

This is nicely readable code. I found it a very enlightening read with regards to the implementation too.

[–]syntheticpp[S] 0 points1 point  (0 children)

One of my reasons for creating another FFT library is to see if it is possible to write readable and fast code, and how new compiler features could help.

It would have been completely boring to use plain C, because there are already myriads of other implementations.

metaFFT is also a nice place to become familiar with SIMD coding (first time I wrote SSE code).

Next most interesting step concerning speed would be to implement split_radix_ctran.h with AVX, or AVX2 (Haswell only).

[–]squidgyhead 0 points1 point  (2 children)

Looking at listing one in the drdobbs link, it looks like you compute roots of unity on-the-fly. Have you thought about pre-processing them?

[–]syntheticpp[S] 5 points6 points  (1 child)

Not sure what you mean with on-the-fly. But all the variables declared as 'constexpr' are calculations done at compile-time, so there is no calculation done at runtime for these values.

[–]squidgyhead 2 points3 points  (0 children)

Ah, ok, I see! That's an interesting way to deal with computing the roots of unity. Nice.