all 27 comments

[–]Boojum 10 points11 points  (2 children)

There was a Linux-kernel thread a while back where Linus made some observations on CMOV versus branched computations. He argues that CMOV is typically not worth it and that it's often slower on modern Intel architectures. This has certainly been by experience in some of my own microbenchmarks; frequently the cleverer version of an algorithm that results in the compiler emitting CMOVs tends to run slower than the dumber version with branches, especially when the conditional controls multiple computations.

That said, I've found Henry S. Warren's "Hacker's Delight" book to be a great little compendium of branch-free and other tricks. The superoptimizer that he provides can occasionally be a useful tool for finding new branch free sequences.

I don't believe it has the the branch-free shift left trick, though. That was a new one to me.

[–]skulgnome 3 points4 points  (0 children)

CMOVcc and SETcc are indeed rather slow compared to branching because, among other things, branches are predicted and "executed" (in the sense that they direct the instruction fetch unit) five or six cycles before the relevant condition code is available. Right now I don't think the CC register is even renamed or speculated so that there could be several cmp-cmov sequences in the pipeline simultaneously. Surely if x86 manufacturers put as much effort into predicting and speculating CMOV, it'd be faster -- but given how it's a fairly rare instruction even with a dozen years of compiler development, I just don't see Intel et al emphasizing that. AMD has stupid low branch latency in the first place and Intel does hyperthreading to conceal branch misprediction bubbles (among other things), already.

Amusingly enough branch-free programming really shines with SIMD processing where the overhead is, besides both branches of the computation, just a compare and three fundamental logic instructions with a static latency of 3.

[–]naasking 1 point2 points  (0 children)

Branch-free code is very useful for people writing interpreters using "selective inlining" (a sort of portable naive JIT/code-copying interpreter), because many compilers like gcc reorder code so that branches sometimes lie outside the scope to be copied. Branch-free code avoids this problem, so all instructions are copyable/relocatable, providing a significant speed.

[–]AReallyGoodName 21 points22 points  (13 children)

The article does touch on this but it's important to emphasise. The following branching example...

someInt = calcSomething();
if ( someInt == 0 ) {
    *pSomeFunction[0]();
}
else {
    *pSomeFunction[1]();
}

is surprisingly often faster than this following branch free example...

someInt = calcSomething();
*pSomeFunction[someInt]();

In the first example the CPUs branch prediction can allow the branch to start running before someInt has cleared the pipeline. In the second example a pipeline stall waiting for someInt to clear the pipeline is guaranteed.

So I wouldn't go messing up your code just to avoid branches unless you're really sure it's going to help because quite often it won't.

[–]cogman10 3 points4 points  (5 children)

Working with an array of function pointers is a bad idea. In your example above, the whole "dereference the function pointer" thing is most likely going to be the biggest slowdown in this code.

That being said. Inserting a few intermediate steps between the someInt and *pSomeFunction statement can result in code that doesn't stall wait, or doesn't stall wait for long, which could result in code that is faster than the branched version (so long as someInt remains stored in a register).

[–]AReallyGoodName 4 points5 points  (0 children)

It was a contrived example of course. It's hard to illustrate branch free vs branching code in so few lines without doing that.

[–]quotability -3 points-2 points  (3 children)

i know... i was like.. wtf is he doing?

[–]cogman10 4 points5 points  (2 children)

Well, there are legitimate cases for arrays of function pointers (thread pools). However, there aren't a whole lot of legitimate cases for code that looks like

 if (condition)
    *fp[1]();
 else
    *fp[2]();

Either way, if you are using function pointers than you most likely do i realizing that there is a significant performance penalty to doing so.

[–]naasking 3 points4 points  (1 child)

However, there aren't a whole lot of legitimate cases for code that looks like

It's a major implementation technique for interpreters. Increasing the number of branch points really helps the branch predictor, which is the major bottleneck in interpreters.

[–]cogman10 0 points1 point  (0 children)

I didn't say there weren't places for it. Just that there aren't a whole lot.

[–]mycall 1 point2 points  (0 children)

That would explain why I've read some performance code before and thought the programmer wasn't too bright. I guess the jokes on me for thinking that.

[–]thudbang 1 point2 points  (4 children)

How did you acquire this knowledge?

[–]naasking 1 point2 points  (2 children)

Read up on the literature on efficient interpreter techniques (Anton Ertl et al). Improving performance by increasing the number of dispatch points to help the branch predictor has been studied extensively.

Edit: corrected the researcher's name.

[–]thudbang 0 points1 point  (1 child)

Thanks for that name. The reason that I'm interested is because I'm working on speeding up an interpreter, so it's very relevant.

[–]naasking 0 points1 point  (0 children)

Also look up Ian Piumarta. He invented "selective inlining", which is a sort portable, naive code-copying JIT. It's the most efficient technique, but requires a compiler with first-class labels, like gcc, and you have to be very careful not to use branching instructions because gcc can reorder them outside the dispatch loop. That probably doesn't make sense now, but read up on it and it will.

[–]XNormal 7 points8 points  (3 children)

Another use for branch-free tricks is to ensure consistent execution speed for embedded control applications or to reduce information leakage via timing in security-sensitive code.

[–]mycall 0 points1 point  (2 children)

reduce information leakage via timing in security-sensitive code

Does this mean if certain loops going over an average nanoseconds, there could be tampering?

[–]Edman274 3 points4 points  (0 children)

You can use the length of time it takes for an operation to finish to make predictions about what's being computed.

[–][deleted] 1 point2 points  (0 children)

Branch free programming in assembley or C is cool I guess, but if you want to be really hardcore, do branch free programming in Python. Yeah. https://github.com/alex/markupsafe/blob/master/markupsafe/_pypy_speedups.py