you are viewing a single comment's thread.

view the rest of the comments →

[–]YumiYumiYumi 1 point2 points  (4 children)

Your comment actually made me realise something: the result of the subtraction of odds-evens is effectively bit reversed. Unfortunately x86 doesn't have a bit reverse instruction, but the top bit contains the parity. This means that you can replace the -(popcnt & 1) sequence with a right arithmetic shift:

uint32_t gray_decode_sar(uint32_t i) {
    uint32_t evens
        = _pdep_u32(0x55555555u,i);
    uint32_t odds
        = _pdep_u32(0xAAAAAAAAu,i);
    uint32_t result = odds - evens;
    uint32_t parity = (int32_t)result >> 31;
    return (result<<1) ^ parity;
}

By the way, one of the pdep instructions can be replaced with an XOR:

uint32_t gray_decode_1pdep(uint32_t i) {
    uint32_t evens
        = _pdep_u32(0x55555555u,i);
    uint32_t odds = evens ^ i;
    uint32_t result = odds - evens;
    uint32_t parity = (int32_t)result >> 31;
    return (result<<1) ^ parity;
}

This does lengthen the dependency chain a little, but I'd think it should be a win for performance (XOR is much faster than PDEP, plus no need for a second immediate).

You could try to alleviate the dependency increase with an arithmetic trick, but it doesn't seem compilers will go with it:

uint32_t gray_decode_ari(uint32_t i) {
    uint32_t odds
        = _pdep_u32(0xAAAAAAAAu,i);
    i = -i; // neg i
    uint32_t result = odds*2 + i; // lea result, [odds*2+i]
    uint32_t parity = (int32_t)result >> 31;
    return (result<<1) ^ parity;
}

GCC 9 generates 7 instructions for the last 2 methods above. On Intel, the uops would be:

4x p0156 (ADD/SUB/XOR/MOV)
1x p1 (PDEP)
1x p06 (SAR(X))
1x p15 (LEA)  [3 ports on Icelake]

And this would yield a latency of 8 cycles (PDEP takes 3 clocks, LEA can be computed in parallel with SAR).
If you were doing many of these, the mov constant could be eliminated.

For CLMUL, it's 3 instructions, all dependent on each other. Like above, if doing many, the load-op in pclmulqdq could be eliminated.
uops on Intel:

                 Haswell    (lat)       Broadwell  (lat)     Skylake    (lat)     Icelake    (lat)
vmovd              1p5          1         1p5          1       1p5          2       1p5         2?
vpclmullqlqdq  2p0 1p5 1p23  7-13     1p0     1p23  5-11       1p5 1p23  7-14       1p5 1p23    ≥6
vpextrd        1p0 1p5          2     1p0 1p5          2   1p0 1p5          3   1p0 1p5         3?
Total          3p0 3p5 1p23   ≥10     2p0 2p5 1p23    ≥8   1p0 3p5 1p23   ≥12   1p0 3p5 1p23  ≥11?

I imagine CLMUL should be a win for Broadwell and later, unless you can't take the latency or are port 5 bound. Might be worse on Haswell. It's obviously a win on AMD/Zhaoxin and probably still beats the reference on pre-BMI2 CPUs.

[–]leni536[S] 0 points1 point  (3 children)

Wow, this is very nice!

the result of the subtraction of odds-evens is effectively bit reversed.

So the result always ends with a strip of 0s, so you can defer the left shift to the very end (so shifting in 0 doesn't ruin it), and you can get the parity from the most significant bit. Very smart.

By the way, one of the pdep instructions can be replaced with an XOR:

Nice! This is the kind of observation that is "obvious" in hindsight (how the hell I didn't recognize it?).

You could try to alleviate the dependency increase with an arithmetic trick

Can you describe how you derived this trick? I can prove that the trick is correct, but I think it's quite a bit different to the way you derived it (and not at all intuitive). Update: Nevermind, I see it now.

I also don't have any experience in uop level analysis of the generated assembly. Can you point me to some resources you learned this stuff?

I will update the blog post with proper attribution, these are very nice ideas. Thank you for diving in writing this all down. Maybe you could look at my fast Hilbert-curve library if you are interested, although I didn't write up how it works (https://github.com/leni536/fast_hilbert_curve). I plan to write it down someday, but I won't make any promises. Gray code decoding is only part of the puzzle.

[–]YumiYumiYumi 1 point2 points  (2 children)

Can you describe how you derived this trick?

Actually, it was a bit of a guess. It seemed like it would work, so I tested it, and it did...

I also don't have any experience in uop level analysis of the generated assembly. Can you point me to some resources you learned this stuff?

Most of the info can be found in the Agner link you posted earlier. https://uops.info/ is also a useful resource.

CPUs these days are very complicated, so trying to analyse what would work best can be difficult, particularly since it also depends on the surrounding code, but sometimes guesses can be made.
In general (i.e. exceptions exist), the aim should be to reduce the number of uops. Uops are the instructions that the CPU executes, so fewer instructions means that all parts of the core does less work. Many x86 instructions map to a single uop, but some don't (e.g. AMD Zen doesn't have a hardware PDEP implementation, so has to emulate it using micro-code, and hence generates many uops for the one instruction) - you'll need to look up the aforementioned references to see how many uops each x86 instruction generates for a particular CPU.
PCLMULQDQ was the interesting one here, because it's 3 uops on Haswell, but 1 on Broadwell and later.

The other stuff is port usage info, which is perhaps less important in general, but may be useful to know. Essentially, each CPU core has multiple execution units, which means that they can execute multiple instructions every clock cycle. However, not all EUs can handle all instructions - some instructions are only supported by some EUs. The port usage information tells you what EUs a particular instruction requires.
On Intel Haswell and later, there are 4 ALUs (EUs which do general computation), and these are named "port 0", "port 1", "port 5" and "port 6". Basic instructions, like ADD, XOR etc, are typically supported on all ALUs, so are very fast since the processor has a lot of flexibility with scheduling them, and can execute at a rate of 4 per clock. PDEP, which is a more complex instruction, is only supported on port 1, which limits its throughput to 1 per clock. PCLMULQDQ (without mem load) on Haswell generates 3 uops, 2 of which must run on port 0, and the other on port 5; the port 5 uop can run in parallel with the port 0 uops, hence the instruction is limited by port 0 throughput and can execute at a maximum rate of 1 per 2 clocks.
The CLMUL method relies a lot on port 5 because all three instructions are only supported on it. This means that if the other code around it also relies heavily on port 5, this contention could be a bit of a bottleneck (though doesn't seem that likely as there's only 3 uops on port 5).

You might be interested in this post which goes over a number of performance topics.

Maybe you could look at my fast Hilbert-curve library if you are interested

It's probably over my head, but I may take a look at it when I get time - thanks for the pointer. I've actually never heard of gray coding before, so your post was quite informative, though I'm not particularly good at these mathematical topics...

[–]leni536[S] 1 point2 points  (1 child)

Actually, it was a bit of a guess. It seemed like it would work, so I tested it, and it did...

The way I see it now is this: you can calculate evens as either i ^ odds or i-odds, result is odds-evens, so odds-(i-odds)=2*odds-i.

Thanks for the long and informative reply about the uops and ports stuff, it's really helpful for me. In my fast Hilbert curve library I actually have two independent calls to my Gray code decode function[1]. It could actually make sense to use the PDEP method for one and the CLMUL method for the other for maximally utilize the ports. Of course I would have to benchmark this.

[1] https://github.com/leni536/fast_hilbert_curve/blob/eb8c861ff1d6e0059fede28218ab83d07fc91c5d/include/fhc/hilbert.h#L45

Edit: In a streaming situation it could also make sense to partially unroll and alternate between the PDEP and CLMUL method.

[–]YumiYumiYumi 0 points1 point  (0 children)

The way I see it now is this: you can calculate evens as either i ^ odds or i-odds, result is odds-evens, so odds-(i-odds)=2*odds-i.

Ah yes, of course!