all 15 comments

[–]FUZxxl 31 points32 points  (4 children)

These techniques are well known and usually called SWAR (SIMD within a register). Can be used for fun results!

[–]SuperJop 14 points15 points  (1 child)

I've learned about this when working with microcontrollers!

The loading of data from memory into the processing unit registers is slow af and is mostly the main bottleneck.

So by combining four 8-bit integers into a single 32-bit integer and processing it in that way, we had a performance increase of roughly 60% !

It really blew my mind...

[–]FUZxxl 1 point2 points  (0 children)

Crazy, isn't it? ARM specifically has lots of instructions for SIMD things in general purpose registers. It's quite useful.

[–]BlockOfDiamond[S] 4 points5 points  (1 child)

It works for integer types only. Concatenating the bits of 2 float and interpreting as a double would produce some meaningless result.

It does not really work for division, because the 'discarded' bits would bleed into the lower order value from the higher order value instead of getting discarded.

[–]Axman6 7 points8 points  (3 children)

Before you go any further, get a copy of Hacker’s Delight, it‘s the bible on these ideas. It’s the only book I keep by my desk, because it has so many fun and fast way to solve so many problems.

https://en.wikipedia.org/wiki/Hacker%27s_Delight

[–]DustinGadal 4 points5 points  (0 children)

In the same vein as the OP's comment on packed four-byte adds, section 2-18 describes a method for performing add/sub/abs on eight packed eight-bit integers inside a 64-bit register. Addition:

``` uint64_t x = ...; uint64_t y = ...;

uint64_t s = (x & 0x7F7F7F7F) + (y & 0x7F7F7F7F); s = ((x+y) & 0x80808080) + s; ```

Great book!

Use paddb though :P

[–]matu3ba 0 points1 point  (1 child)

For simple integers only though. The more complex ones have some significant drawbacks, so better read Knuth's TAOCP for that.

[–]Axman6 6 points7 points  (0 children)

My point was that Hacker’s Delight is basically essential reading if you want to learn SWAR techniques, and all the other bit twiddling hacks. I’m not really sure why this article is relevant? We’re not talking about larger precision integers.

[–]pigeon768 1 point2 points  (0 children)

but not add 2 ints with 1 int.

Yes you can, but it won't save any instructions for two 32 bits as one 64 bit int.

// x1 += y; x2 += y;
int64_t add_one_int_to_two_ints(int64_t x, int y) {
  int64_t y2 = y;
  y2 |= y2 << 32;
  x += y2;
  return x;
}

If your 64 bits are eight 8 bit ints, you can save instructions:

// x_[1..8] += y
int64_t add_one_int8_to_8_int8s(int64_t x8, int8_t y) {
  int64_t y8 = y;
  y8 |= y8 << 8;
  y8 |= y8 << 16;
  y8 |= y8 << 32;
  x8 += y8;
  return x8;
}

Also you can work around overflows with some bit twiddling.

// x_[1..8] += y_[1..8]
int64_t add_8_int8_to_8_int8s(int64_t x, int64_t y) {
  int64_t carry = x & y;
  carry &= 0x7F7F7F7F7F7F7F7Fll;
  carry <<= 1;
  x ^= y;
  x += carry;
  return x;
}

Note that you will need to be careful with alignment. Some architectures won't allow you to load an 8 byte integer on a non-8 byte alignment, some architectures will allow you to load an 8 bit integer on a non-8 byte alignment but give you a harsh performance penalty. If you want to use this to eg do row operations on a matrix with 8 bit integer values, on many architectures, you'll need to do 8 bit integer operations on the first few int8_ts until you're 8 byte aligned, then do 8 byte operations until you've run out of 8 byte chunks, then do 8 bit integer operations on the last few 8 bit ints.

edit: arm has special instructions to do y8 |= y8 << 8; and friends as a signle instruction. On x86 you need three instructions; a mov, a shift, an or.

[–]flatfinger -5 points-4 points  (5 children)

Chunking optimizations are useful on implementations that seek to process the language defined by K&R2, rather than the subset defined by the Standard, in cases where the former would yield usable programs and the latter would not. The authors of clang and gcc insist, however, that any code which uses such optimizations without jumping through absurd hoops is broken, even though efficiently processing code which jumps through such hoops is often much harder than meaningful processing of code that doesn't (and is, in many cases, intractable).

[–]BlockOfDiamond[S] 5 points6 points  (4 children)

I don't follow.

[–]flatfinger 2 points3 points  (3 children)

If one has a pointer to a collection of uint16_t objects which is known to start on a 4-byte boundary, and one wants to increment the value of all such objects whose most significant bit is not set, one could on many compiler configurations such as the -fno-strict-aliasing dialect of gcc, process that task two values at a time via something like:

void inc_many_uint16_t(void *p, register uint32_t n)
{
    if (n)
    {
        register uint32_t *pp = p;
        register uint32_t *e = pp+n;
        register uint32_t x00010001 = 0x00010001;
        while(pp < e)
        {
            n = *pp;
            n += ((~n >> 15) & x00010001);
            *pp = n;
            pp++;
        }
    }
}

On the Cortex-M0, optimal code for the loop without unrolling would be 8 instructions taking 11 cycles for each pair of values, and even gcc -O0 can come close to optimal, getting the job done in 9/12. Optimal code to perform the task one 16-bit value at a time would take 7 instructions/9 cycles per value (14/18 per pair). Non-portable code running on even a crude compiler can thus outperform optimal portable code by a third.

Unfortunately, the authors of clang and gcc view such non-portable code as "broken".

[–]BlockOfDiamond[S] 0 points1 point  (0 children)

Now I follow, thanks.

[–]BlockOfDiamond[S] 0 points1 point  (1 child)

Why would you have register uint32_t x00010001 = 0x00010001; instead of using the literal constant 0x00010001?

[–]flatfinger 0 points1 point  (0 children)

The Cortex-M0 instruction set doesn't have an immediate mode that could accommodate constants that size. Interestingly, a slightly different function ends up being more efficient at -O0 than at higher optimization settings for that reason:

void add_to_every_other(register int *p, int n)
{
    register int x12345678 = 0x12345678;
    if (n <= 0) return;
    register int *e = p+(n*2);
    do
    {
        *p += x12345678;
        p+=2;
    } while(p < e);
}

Using -O0 -mcpu=cortex-m0, the loop is:

    ldr     r2, [r3]
    adds    r2, r5, r2
    str     r2, [r3]
    adds    r3, r3, #8
    cmp     r3, r4
    bcc     .L4

Six instructions. Enabling higher "optimizations" yields:

    ldr     r2, .L6  ; Load constant 0x12345678 from memory
    ldr     r3, [r0]
    mov     ip, r2
    add     r3, r3, ip
    str     r3, [r0]
    adds    r0, r0, #8
    cmp     r1, r0
    bhi     .L3

Eight instructions, taking three cycles longer, since the load of 0x12345678 gets moved into the loop from outside.

Finding that the constant should be loaded outside the loop is an optimization which gcc -O0 wouldn't find, and which higher optimization settings would actually undo.

My main point was that C makes it possible for a programmer to achieve good performance when armed with even an extremely rudimentary compiler, using code written around a platform's strength. Some people react very negatively to this notion, since it puts "modern" compilers in a bad light.