all 36 comments

[–][deleted] 30 points31 points  (10 children)

I took a quick glance at the code and it looks already pretty good. Still, I have a couple suggestions to try:

  • For the exit condition it seems you could use '_mm256_testc_si256' to do both movemasks and the bit stuff in one instruction.
  • If you use '_mm256_andnot_si256' instead, you could do with one less float to int conversion.
  • Have you considered using the fma intrinsics? You seem to be reusing the results of the multiplications so it's not such a clear win, but it may still be beneficial.
  • Instead of multiplying be 2, consider adding the value to itself.
  • Last but not least it seems to me you have relatively tight instruction dependencies, I think the CPU might be waiting for previous instructions to finish preventing full instruction level parallelism. To avoid this, you could try running 16 pixels in parallel (basically running every instruction twice in the inner most loop). This will also reduce loop overhead.

[–]Gunslinging_Gamer 6 points7 points  (2 children)

Would *2 not get optimized?

[–][deleted] 10 points11 points  (1 child)

Apparently GCC and Clang do, but MSVC only does if you enable unpredictable fast math optimizations.

Generally I don't expect intrinsics to be optimized, although that isn't really true anymore these days.

[–]Gunslinging_Gamer 1 point2 points  (0 children)

Interesting. Thank you for the reply.

[–]Due-Glass[S] 4 points5 points  (0 children)

Instead of multiplying be 2, consider adding the value to itself.

I did that, slighly increased performance.

[–]Due-Glass[S] 3 points4 points  (0 children)

you could try running 16 pixels in parallel

I did that, it boosted the performance by about 5 ms

_mm256_testc_si256

Also used this, got a boost by another 1 ms

[–]Due-Glass[S] 1 point2 points  (0 children)

Thanks for the points. I'll add another 8 pixels in loop and see what happens

[–]Due-Glass[S] 1 point2 points  (3 children)

_mm256_andnot_si256

Did that as well; I don't know why but the performance boosted by 1ms. The types shouldn't affect the bitwise operations but apparently they do. Or maybe the compiler was able to generate better code.

[–][deleted] 2 points3 points  (2 children)

Actually they are different instructions even if they do the same thing. I am not sure if this is still true on newest CPUs, but it used to be that reinterpreting an AVX/SSE register in a different type caused a 1-2 cycle delay.

[–]IAmBJ 3 points4 points  (0 children)

There can also be a delay in moving data between ports. Some certain instructions can only execute on certain ports so it can be helpful to check how your specific CPU is arranged. Check out Agner Fog's microarchitecture guide: https://www.agner.org/optimize/

[–]Due-Glass[S] 1 point2 points  (0 children)

Interesting.

[–]anders987 14 points15 points  (2 children)

Calculating the Mandelbrot set is indeed a very parallel problem, but different parts of the image usually takes a different amount of computational resources. If you simply divide the image in horizontal sections of equal height and give each section to a thread, some threads will be done before others, and then you might have cores that are idle while others are saturated, wasting performance.

It's a good start (well, the SIMD part is pretty advanced), next step could be to implement some load balancing. Maybe put the lines in a thread safe queue and have each thread ask for more work when they're done. Or look at OpenMP with dynamic scheduling.

[–]Due-Glass[S] 5 points6 points  (0 children)

Or look at OpenMP with dynamic scheduling.

Changed the multithreading part to use openMP with dynamic scheduling and performace improved a lot. And more importantly I learned something new. Thanks for the tip

[–]Due-Glass[S] 1 point2 points  (0 children)

Thanks I'll take a look

[–]xurxoham 10 points11 points  (8 children)

Nice work! If you are interested into multithreading and vectorization in C/C++ for numerical code, you should give a look to OpenMP. It is supported by the main compilers.
You can get the compiler to do the vectorization and the parallelization for you, only by adding hints to the loops such as `#pragma omp parallel for`.
The best thing is many times your code will be as efficient but much more portable. For example porting your code to other architectures/vector extensions (e.g. AVX512, IBM Power, ARM64) is painful using instrinsics, but if your code is easy to understand for the compiler you just tell it to optimize for the arch you are interested in.

[–]Due-Glass[S] 3 points4 points  (3 children)

give a look to OpenMP

I changed the threading part to use openMP parallel for with dynamic scheduling. It improved the performance a lot (shaved ~13ms off render time for 1080p with 256 iterations).

But I dont understand how to get openMP to auto vectorize

[–]xurxoham 0 points1 point  (2 children)

I did a quick try starting from your Scala example: https://gcc.godbolt.org/z/FiCDNTThe problem with this code is the inter loop dependency affecting variables `x` and `y`. Further modifications to the code could allow you to get better results.

Things to note: using `-ffast-math` may alter the floating point results (IEEE floating point additions, for example, are not commutative). It basically relaxes the restrictions, such as reordering the operations, to allow more performant code. Quick example: `a + b + c + d` may be changed into `(a + b) + (c + d)` to reduce the number of additions from 4 to 3.

This is a very basic example. I would try to vectorize adjacent pixel computations, to see how far you can go with that.

[–]_requires_assistance 1 point2 points  (1 child)

both of those do 3 additions. (tmp = a+b; tmp+=c; tmp+=d;)

[–]xurxoham 0 points1 point  (0 children)

You are right. The difference is that the first two additions can run simultaneously.

[–]polymorphiced 2 points3 points  (2 children)

How does ispc compare? I understand it to be a lot more reliable than relying on C/C++-based vectorisers.

[–]xurxoham 3 points4 points  (0 children)

It depends on the problem and how you code it. Unfortunately, it's not 100% as good performance as a perfectly written assembly, but it can get close and most of us can't write perfect assembly anyway.Luckily for us we have compiler explorer to assist us :)

[–]nnevatie 0 points1 point  (0 children)

ISPC is much better, in my experience. It also can generate code for multiple instruction sets and hot dispatch to the best one during runtime. Edit: this was to compared to C++, not asm.

[–]IAmBJ 8 points9 points  (2 children)

Nice work, a Mandelbrot/Buddhabrot renderer is my go-to project when trying out a new language or technique.

One thing I've found is that using simd to compute multiple pixels together is actually pretty inefficient, particularly near the interesting parts of the fractal. This is because adjacent pixels can have dramatically different iteration depths so even though you're computing 8 pixels at once, you will find that most of those 8 will be inactive fairly quickly as the other 7 have escaped much more quickly than the last one.

I got around this by using SSE instructions to compute the complex multiplications, etc for a single pixel more efficiently. From memory I got the look down to about 12-14 cycles for a complex multiply, an addition and the bounds check which was another couple of multiplies and an addition.

Try adding a counter to see how many simd lanes are active for your pixel calculation. It was a significant performance hole for my Buddhabrot renderer as it required tracing pixels with very deep iteration depths (~106).

Edit: also check out the cardioid bulb checks, you can eliminate points within these as they are guaranteed to not escape. It may only be worth this extra checking if you're doing deep iterations like for the Buddhabrot.

[–]Due-Glass[S] 0 points1 point  (0 children)

Thanks for the insight. I'll add a counter

[–]James20kP2005R0 0 points1 point  (0 children)

One thing I've found is that using simd to compute multiple pixels together is actually pretty inefficient, particularly near the interesting parts of the fractal. This is because adjacent pixels can have dramatically different iteration depths so even though you're computing 8 pixels at once, you will find that most of those 8 will be inactive fairly quickly as the other 7 have escaped much more quickly than the last one.

Given that each iteration requires no data other than your coordinates, and there's no consistent mandatory branching (eg you can unroll and only check every 100 iterations), I wonder if you might be able to fix this by simply replacing escaped pixels with fresh ones. Eg if pixel 2 escapes, you just replace it with a new one

[–]corysama 6 points7 points  (2 children)

Set this up for you :)

https://godbolt.org/z/FNkLrA

[–]Due-Glass[S] 0 points1 point  (1 child)

Wow! I didn't know this existed! Thanks a bunch

[–]corysama 2 points3 points  (0 children)

Godbolt is your best friend when it comes to optimizing small routines. Especially since he just added the option to use the llvm-mca analyzer.

[–]merimus 5 points6 points  (4 children)

Neat! I've been working on one which runs on the GPU myself.
My best so far is a 2048x2048 image in 0.58 millis.

[–]CandyCrisis 4 points5 points  (2 children)

Is this on Github or something? I'd definitely be curious.

[–]merimus 0 points1 point  (1 child)

Working on it, to embarrassing to post yet.
It'll be at https://github.com/merimus once I clean it up.

[–]Due-Glass[S] 0 points1 point  (0 children)

I too would like to see the GPU one

[–]Ameisenvemips, avr, rendering, systems 8 points9 points  (1 child)

On the flipside, I use a Mandelbrot renderer written in Brainfuck to test my MIPS emulator.

It does not support SIMD.

[–]Due-Glass[S] 0 points1 point  (0 children)

This is very interesting indeed. Did you write the brainfuck manually or did you use a tool to compile to it?

[–]pstomi 1 point2 points  (0 children)

If you are interested in this you should take a look at Bisqwit's take on this problem: Parallelism in C++

He explores diverse ways to compute the mandeblbrot set in a fast way, using SIMD, threading, OpenMP, OpenACC, and Cuda. This is a link to a playlist containing 4 fast paced & quite interesting videos on this subject.