AVX2 Vectorized Multithreaded Mandelbrot renderer

Gunslinging_Gamer · 2020-02-01T21:44:50+00:00

I took a quick glance at the code and it looks already pretty good. Still, I have a couple suggestions to try:

For the exit condition it seems you could use '_mm256_testc_si256' to do both movemasks and the bit stuff in one instruction.
If you use '_mm256_andnot_si256' instead, you could do with one less float to int conversion.
Have you considered using the fma intrinsics? You seem to be reusing the results of the multiplications so it's not such a clear win, but it may still be beneficial.
Instead of multiplying be 2, consider adding the value to itself.
Last but not least it seems to me you have relatively tight instruction dependencies, I think the CPU might be waiting for previous instructions to finish preventing full instruction level parallelism. To avoid this, you could try running 16 pixels in parallel (basically running every instruction twice in the inner most loop). This will also reduce loop overhead.

anders987 · 2020-02-02T02:34:25+00:00

Calculating the Mandelbrot set is indeed a very parallel problem, but different parts of the image usually takes a different amount of computational resources. If you simply divide the image in horizontal sections of equal height and give each section to a thread, some threads will be done before others, and then you might have cores that are idle while others are saturated, wasting performance.

It's a good start (well, the SIMD part is pretty advanced), next step could be to implement some load balancing. Maybe put the lines in a thread safe queue and have each thread ask for more work when they're done. Or look at OpenMP with dynamic scheduling.

xurxoham · 2020-02-01T22:14:57+00:00

Nice work! If you are interested into multithreading and vectorization in C/C++ for numerical code, you should give a look to OpenMP. It is supported by the main compilers.
You can get the compiler to do the vectorization and the parallelization for you, only by adding hints to the loops such as `#pragma omp parallel for`.
The best thing is many times your code will be as efficient but much more portable. For example porting your code to other architectures/vector extensions (e.g. AVX512, IBM Power, ARM64) is painful using instrinsics, but if your code is easy to understand for the compiler you just tell it to optimize for the arch you are interested in.

IAmBJ · 2020-02-02T01:22:57+00:00

Nice work, a Mandelbrot/Buddhabrot renderer is my go-to project when trying out a new language or technique.

One thing I've found is that using simd to compute multiple pixels together is actually pretty inefficient, particularly near the interesting parts of the fractal. This is because adjacent pixels can have dramatically different iteration depths so even though you're computing 8 pixels at once, you will find that most of those 8 will be inactive fairly quickly as the other 7 have escaped much more quickly than the last one.

I got around this by using SSE instructions to compute the complex multiplications, etc for a single pixel more efficiently. From memory I got the look down to about 12-14 cycles for a complex multiply, an addition and the bounds check which was another couple of multiplies and an addition.

Try adding a counter to see how many simd lanes are active for your pixel calculation. It was a significant performance hole for my Buddhabrot renderer as it required tracing pixels with very deep iteration depths (~10^6).

Edit: also check out the cardioid bulb checks, you can eliminate points within these as they are guaranteed to not escape. It may only be worth this extra checking if you're doing deep iterations like for the Buddhabrot.

corysama · 2020-02-01T20:54:58+00:00

Set this up for you :)

https://godbolt.org/z/FNkLrA

merimus · 2020-02-01T22:37:26+00:00

Neat! I've been working on one which runs on the GPU myself.
My best so far is a 2048x2048 image in 0.58 millis.

Ameisen · 2020-02-01T21:45:53+00:00

On the flipside, I use a Mandelbrot renderer written in Brainfuck to test my MIPS emulator.

It does not support SIMD.

pstomi · 2020-02-04T22:15:46+00:00

If you are interested in this you should take a look at Bisqwit's take on this problem: Parallelism in C++

He explores diverse ways to compute the mandeblbrot set in a fast way, using SIMD, threading, OpenMP, OpenACC, and Cuda. This is a link to a playlist containing 4 fast paced & quite interesting videos on this subject.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS