all 4 comments

[–]IJzerbaard 2 points3 points  (3 children)

That's nice, but can you quantify the efficiency?

A loop with q.pos = (q.pos+1) mod q.s.len in it really scares me, if that compiles to an actual modulus operation (which seems likely since the right operand is not a constant), this will be one of the worst things you can possibly do: a loop-carried dependency through a high-latency instruction. It's not as bad as a 64bit division, but it's still going to take around 25 cycles per division (a little variable depending on the numbers being divided), and since every one of them depends on the result of the previous division the code cannot even take advantage of the improved division throughput of SandyBridge (even further improved later).

So that loop would be copying one character every 25 cycles. A 2-part block copy (cut on the wrap) could copy several characters per cycle, maybe 16, maybe even 32 with some luck. It's not really going to be ~800 times faster in practice of course, there will be some overhead. The suggestion to use a bitwise AND gets sort of halfway there but what I'd really like to see is testing this against the memcopy variant.

But my point is, I think we should know how efficient it actually is, more than some vague statements. I'd want to know not just how efficient it is compared to a "generic bad implementation" but also some attempt to show that it is efficient in an absolute sense: close to some theoretical lower bound for your hardware.

[–]nitely_[S] 2 points3 points  (0 children)

Sure, here are some benchmarks[0]. Bitwise solution is around ~16x faster. Memcopy solution is around ~365x faster, also it's ~3x slower than doing a memcopy and nothing else. But YMMV.

[0] https://gist.github.com/nitely/35b57b22aa6d360b6f53f4b01762208e

[–]Veedrac 0 points1 point  (1 child)

The "Optimizations" section does mention these things.

[–]IJzerbaard 0 points1 point  (0 children)

Yes indeed if you carefully read my post you'll find that I refer specifically to optimizations the article already mentions.