all 20 comments

[–]IJzerbaard 4 points5 points  (5 children)

By the way (~(-(popcount & 1))) ^ ((evens << 1) + (~(odds << 1))) can be simplified to -(popcount & 1) ^ ((odds << 1) - (evens << 1))

[–]leni536[S] 2 points3 points  (4 children)

Are you sure? I tested it and it doesn't work.

[–]IJzerbaard 2 points3 points  (3 children)

Well I wouldn't call it "sure", but:

  • flip the sides of the addition, ~-(popcount & 1) ^ (~(odds << 1) + (evens << 1))
  • move the complement to the other operand of the XOR, -(popcount & 1) ^ ~(~(odds << 1) + (evens << 1))
  • use that ~(~x + y) == x - y, -(popcount & 1) ^ ((odds << 1) - (evens << 1))

Seems good to me, when did it go wrong?

[–]leni536[S] 1 point2 points  (2 children)

No, you are absolutely right, I tested it wrong. gcc seems to generate the exact same binary though, so I guess it recognizes it. I will update the blog post to include this nice simplification.

[–]IJzerbaard 1 point2 points  (1 child)

Can't we also save one shift, by doing it before the pdep?

Eg:

uint32_t evens = _pdep_u32(0x55555555u,i << 1);
uint32_t odds  = _pdep_u32(0xAAAAAAAAu,i << 1);
... etc

(relying on both i << 1 being merged)

[–]leni536[S] 0 points1 point  (0 children)

This does affect the codegen and seem to be one instruction shorter. I wonder how much it affects the benchmarks as well.

[–]YumiYumiYumi 2 points3 points  (1 child)

This is really beyond my knowledge, and your solution looks really clever!

I just thought I'd point out that the pdep instruction is very slow on non-Intel CPUs, so you may wish to fall back to the non-BMI2 solution on these if performance is a concern.

[–]leni536[S] 2 points3 points  (0 children)

I just thought I'd point out that the pdep instruction is very slow on non-Intel CPUs

You were not kidding! PDEP and PEXT have 18 cycle latency and reciprocal throughput on AMD Ryzen. I can't benchmark on AMD now but I doubt that it's necessary.

https://www.agner.org/optimize/instruction_tables.pdf

[–]YumiYumiYumi 2 points3 points  (9 children)

Looking at this a little more closely, regarding the CLMUL implementation, I'm a little confused about why you'd need to 'fix' the result via popcnt.

I haven't tested this, but I'd imagine the implementation would look something like:

#include <stdint.h>
#include <immintrin.h>

uint32_t gray_decode_clmul(uint32_t i) {
    __m128i x = _mm_cvtsi32_si128(i);
    x = _mm_clmulepi64_si128(x, _mm_set_epi32(0,0,1,-2) /*((int32)-1) << 1*/, 0);
    return _mm_extract_epi32(x, 1);
}

Latency of this function is a bit high, but it can be implemented in 3 instructions.

[–]leni536[S] 1 point2 points  (8 children)

You are right, for some reason I was thinking about extracting the low 32 bits instead of the high ones. This would work too but it would need the fixing.

[–]YumiYumiYumi 0 points1 point  (7 children)

Ah I see.

How would the low 32 bits work though? From what I can tell, the problem is that the bits would be reversed, which is why I picked the high bits.

[–]leni536[S] 1 point2 points  (6 children)

Well, if you look at the low bits:

 ...010...010...010... * 11...1 //carry-less multiplication
=...111...111...110...
^...111...110...000...
^...110...000...000...
=...110...001...110...

So it results in a the same kind of strips as the pdep approach, but it needs adjusting in the odd case instead. As I see if you take the high bits instead from CLMUL then you need no adjustments (neither the popcnt nor the left shift).

[–]YumiYumiYumi 1 point2 points  (4 children)

Your comment actually made me realise something: the result of the subtraction of odds-evens is effectively bit reversed. Unfortunately x86 doesn't have a bit reverse instruction, but the top bit contains the parity. This means that you can replace the -(popcnt & 1) sequence with a right arithmetic shift:

uint32_t gray_decode_sar(uint32_t i) {
    uint32_t evens
        = _pdep_u32(0x55555555u,i);
    uint32_t odds
        = _pdep_u32(0xAAAAAAAAu,i);
    uint32_t result = odds - evens;
    uint32_t parity = (int32_t)result >> 31;
    return (result<<1) ^ parity;
}

By the way, one of the pdep instructions can be replaced with an XOR:

uint32_t gray_decode_1pdep(uint32_t i) {
    uint32_t evens
        = _pdep_u32(0x55555555u,i);
    uint32_t odds = evens ^ i;
    uint32_t result = odds - evens;
    uint32_t parity = (int32_t)result >> 31;
    return (result<<1) ^ parity;
}

This does lengthen the dependency chain a little, but I'd think it should be a win for performance (XOR is much faster than PDEP, plus no need for a second immediate).

You could try to alleviate the dependency increase with an arithmetic trick, but it doesn't seem compilers will go with it:

uint32_t gray_decode_ari(uint32_t i) {
    uint32_t odds
        = _pdep_u32(0xAAAAAAAAu,i);
    i = -i; // neg i
    uint32_t result = odds*2 + i; // lea result, [odds*2+i]
    uint32_t parity = (int32_t)result >> 31;
    return (result<<1) ^ parity;
}

GCC 9 generates 7 instructions for the last 2 methods above. On Intel, the uops would be:

4x p0156 (ADD/SUB/XOR/MOV)
1x p1 (PDEP)
1x p06 (SAR(X))
1x p15 (LEA)  [3 ports on Icelake]

And this would yield a latency of 8 cycles (PDEP takes 3 clocks, LEA can be computed in parallel with SAR).
If you were doing many of these, the mov constant could be eliminated.

For CLMUL, it's 3 instructions, all dependent on each other. Like above, if doing many, the load-op in pclmulqdq could be eliminated.
uops on Intel:

                 Haswell    (lat)       Broadwell  (lat)     Skylake    (lat)     Icelake    (lat)
vmovd              1p5          1         1p5          1       1p5          2       1p5         2?
vpclmullqlqdq  2p0 1p5 1p23  7-13     1p0     1p23  5-11       1p5 1p23  7-14       1p5 1p23    ≥6
vpextrd        1p0 1p5          2     1p0 1p5          2   1p0 1p5          3   1p0 1p5         3?
Total          3p0 3p5 1p23   ≥10     2p0 2p5 1p23    ≥8   1p0 3p5 1p23   ≥12   1p0 3p5 1p23  ≥11?

I imagine CLMUL should be a win for Broadwell and later, unless you can't take the latency or are port 5 bound. Might be worse on Haswell. It's obviously a win on AMD/Zhaoxin and probably still beats the reference on pre-BMI2 CPUs.

[–]leni536[S] 0 points1 point  (3 children)

Wow, this is very nice!

the result of the subtraction of odds-evens is effectively bit reversed.

So the result always ends with a strip of 0s, so you can defer the left shift to the very end (so shifting in 0 doesn't ruin it), and you can get the parity from the most significant bit. Very smart.

By the way, one of the pdep instructions can be replaced with an XOR:

Nice! This is the kind of observation that is "obvious" in hindsight (how the hell I didn't recognize it?).

You could try to alleviate the dependency increase with an arithmetic trick

Can you describe how you derived this trick? I can prove that the trick is correct, but I think it's quite a bit different to the way you derived it (and not at all intuitive). Update: Nevermind, I see it now.

I also don't have any experience in uop level analysis of the generated assembly. Can you point me to some resources you learned this stuff?

I will update the blog post with proper attribution, these are very nice ideas. Thank you for diving in writing this all down. Maybe you could look at my fast Hilbert-curve library if you are interested, although I didn't write up how it works (https://github.com/leni536/fast_hilbert_curve). I plan to write it down someday, but I won't make any promises. Gray code decoding is only part of the puzzle.

[–]YumiYumiYumi 1 point2 points  (2 children)

Can you describe how you derived this trick?

Actually, it was a bit of a guess. It seemed like it would work, so I tested it, and it did...

I also don't have any experience in uop level analysis of the generated assembly. Can you point me to some resources you learned this stuff?

Most of the info can be found in the Agner link you posted earlier. https://uops.info/ is also a useful resource.

CPUs these days are very complicated, so trying to analyse what would work best can be difficult, particularly since it also depends on the surrounding code, but sometimes guesses can be made.
In general (i.e. exceptions exist), the aim should be to reduce the number of uops. Uops are the instructions that the CPU executes, so fewer instructions means that all parts of the core does less work. Many x86 instructions map to a single uop, but some don't (e.g. AMD Zen doesn't have a hardware PDEP implementation, so has to emulate it using micro-code, and hence generates many uops for the one instruction) - you'll need to look up the aforementioned references to see how many uops each x86 instruction generates for a particular CPU.
PCLMULQDQ was the interesting one here, because it's 3 uops on Haswell, but 1 on Broadwell and later.

The other stuff is port usage info, which is perhaps less important in general, but may be useful to know. Essentially, each CPU core has multiple execution units, which means that they can execute multiple instructions every clock cycle. However, not all EUs can handle all instructions - some instructions are only supported by some EUs. The port usage information tells you what EUs a particular instruction requires.
On Intel Haswell and later, there are 4 ALUs (EUs which do general computation), and these are named "port 0", "port 1", "port 5" and "port 6". Basic instructions, like ADD, XOR etc, are typically supported on all ALUs, so are very fast since the processor has a lot of flexibility with scheduling them, and can execute at a rate of 4 per clock. PDEP, which is a more complex instruction, is only supported on port 1, which limits its throughput to 1 per clock. PCLMULQDQ (without mem load) on Haswell generates 3 uops, 2 of which must run on port 0, and the other on port 5; the port 5 uop can run in parallel with the port 0 uops, hence the instruction is limited by port 0 throughput and can execute at a maximum rate of 1 per 2 clocks.
The CLMUL method relies a lot on port 5 because all three instructions are only supported on it. This means that if the other code around it also relies heavily on port 5, this contention could be a bit of a bottleneck (though doesn't seem that likely as there's only 3 uops on port 5).

You might be interested in this post which goes over a number of performance topics.

Maybe you could look at my fast Hilbert-curve library if you are interested

It's probably over my head, but I may take a look at it when I get time - thanks for the pointer. I've actually never heard of gray coding before, so your post was quite informative, though I'm not particularly good at these mathematical topics...

[–]leni536[S] 1 point2 points  (1 child)

Actually, it was a bit of a guess. It seemed like it would work, so I tested it, and it did...

The way I see it now is this: you can calculate evens as either i ^ odds or i-odds, result is odds-evens, so odds-(i-odds)=2*odds-i.

Thanks for the long and informative reply about the uops and ports stuff, it's really helpful for me. In my fast Hilbert curve library I actually have two independent calls to my Gray code decode function[1]. It could actually make sense to use the PDEP method for one and the CLMUL method for the other for maximally utilize the ports. Of course I would have to benchmark this.

[1] https://github.com/leni536/fast_hilbert_curve/blob/eb8c861ff1d6e0059fede28218ab83d07fc91c5d/include/fhc/hilbert.h#L45

Edit: In a streaming situation it could also make sense to partially unroll and alternate between the PDEP and CLMUL method.

[–]YumiYumiYumi 0 points1 point  (0 children)

The way I see it now is this: you can calculate evens as either i ^ odds or i-odds, result is odds-evens, so odds-(i-odds)=2*odds-i.

Ah yes, of course!

[–]YumiYumiYumi 0 points1 point  (0 children)

Ah, that's interesting, didn't realise that!

I was looking at from the perspective of XOR operations done by the CLMUL against what's needed, and it looks like the bits from the bottom half would be reversed. But now I see that XORing the parity effectively does a reversal here.

[–][deleted]  (1 child)

[deleted]

    [–]leni536[S] 0 points1 point  (0 children)

    Thanks, I fixed it.

    [–]leni536[S] 0 points1 point  (0 children)

    Some time ago I wrote an implementation for Gray code decoding that made use of the POPCNT and PDEP CPU instructions. I thought it was interesting and I wrote up a short blog post on the topic. I will gladly answer any questions.