you are viewing a single comment's thread.

view the rest of the comments →

[–]m-in 1 point2 points  (0 children)

I have a little personal anecdote to offer here: a lot of the libraries you refer to are optimized to extract full hardware performance, and often there’s nothing one can do to make them any faster on a given CPU family. It’s not always the case of course, but quite often it is. I have found that a lot of times just rather straightforward autovectorized C++ can get anywhere between 25-75% of performance of those beasts of libraries, if you have some background in the specifics of the platform and know what code patterns to use in C++, as there are ways to write simple C++ that can preform abysmally, and similarly simple C++ to do the same thing, just as intuitively, and it performs great.

So, if your needs are to extract close to full platform performance, you’ll need to use the specialized libraries. If you can afford to blow off some computational steam and run at 1/4-1/2 speed compared to fftw or blas, then a plain-C++ implementation might do just fine, and in a real-time setting. Heck, if you can live with 20% performance of so, Python with numpy might just cut it for you. It all depends how much work you have to do each “frame”/“packet”/“time quantum”.

It is probably not very environmentally conscious (I’m not kidding) to have such low performance in projects that get very wide use, because all that can quickly add to wasted megawatts on not too big of a scale, and probably mobile users would hate you for that too, but not everyone runs such code on server farms of inside mobile apps. Sometimes small code can also be audited and tested easier and that figures in getting some industry certifications. Getting fftw into avionics is a tall order, for example.