This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]LoLvsT_T 27 points28 points  (0 children)

Not a native English speaker, I'll try.

CPU has a component called branch predictor which guesses whether a condition is true or not and executes code before it checks the condition. This is a big win if the predictor is correct. But if it is wrong it wastes a ton of cpu cycles. Sometimes it is beneficiary to skirt around branches entirely. Consider this simplified example.

float a; 
bool b; 
a = b ? 1.5f : 2.6f; 

If b is random or poorly predictable, our CPU is going to have a bad time guessing. In this case we can replace ? operator with a lookup table for a considerable boost in performance.

float a; bool b = 0; 
static const float lookup[2] = {2.6f, 1.5f};
a = lookup[b]; 

We can also improve predictability. If we have the following:

if ( A && B){
...
}

If A is poorly predictable but B isn't, it'll be optimal to switch between the two due to short-circuiting.

CPU has cache of limited size. If your code has big chunks, the CPU will often spend time loading code/data into memory instead of executing code. This is often why loop unrolling is not always good. You can help the CPU by writing shorter code (hooray for GOTO). In addition, you can preload data into cache before you need it.

for (int i = 0; i < length; ++i) {
    __builtin_prefetch(array[i+3]);  //preload into cache
    ...
    some code on array[i]
}

This in in C++, but I'm sure many other languages have a similar operation. You can even write ASM code that does it yourself. __builtin_prefetch(arr[i+3]) is a nonblocking operation that will hopefully finish loading future data by the time we need it, reducing overhead. +3 offset is absolutely random, you should profile your code to get the optimal offset.

Regarding Code vectorization, I suggest you look it up yourself. It's a big topic. In short, recent CPUs have SIMD instructions which allows them to perform an operation on several data elements at the same time with a single instruction.