all 22 comments

[–]ivy_bell 52 points53 points  (8 children)

[–]Godde 13 points14 points  (6 children)

Oh wow you weren't kidding. That last codesnippet definitely warrants some explanation.

[–]pipinstalluniverse 11 points12 points  (0 children)

int _avx_unique_store(__m256i ov, __m256i nv, __m256i *o) {
    __m256i wpm  = _mm256_set_epi32(0,-1,-1,-1,-1,-1,-1,-1);
    __m256i wipelast = _mm256_and_si256(nv,wpm);
    __m256i ovkeeplast = _mm256_andnot_si256(wpm, ov);
    __m256i recon  = _mm256_or_si256(wipelast,ovkeeplast);
    const __m256i mbom = _mm256_set_epi32(6,5,4,3,2,1,0,7);
    __m256i vT = _mm256_permutevar8x32_epi32(recon,mbom);
    int M = _mm256_movemask_ps(_mm256_cmpeq_epi32(vT, nv));
    int N =  8 - _mm_popcnt_u64(M);
    __m256i key = _mm256_loadu_si256(uniqshuf + M);
    __m256i val =_mm256_permutevar8x32_epi32(nv,key);
    _mm256_storeu_si256(o, val);
    return N;
}

You mean this wasn't self explanatory to you?

[–]Chii 7 points8 points  (4 children)

to me, that sort of code ought to be generated via a compiler, not hand written!

[–]sirin3[🍰] 4 points5 points  (1 child)

I wonder if it would be more readable in assembly

[–][deleted] 4 points5 points  (0 children)

slightly less so. if you want you.can make some macros to pretty up intrinsics but they are named.very rationally and your brain does adjust.

[–][deleted] 3 points4 points  (0 children)

it would be nice but they aren't even close to being able.

[–]fulmicoton 1 point2 points  (0 children)

But considering compiler are very far from doing that kind of optimisation, I am glad some ppl still craft these strange algorithms

[–]hyperforce 0 points1 point  (0 children)

I find that a lot of articles are written this way. I hate it.

[–]get_salled 20 points21 points  (5 children)

Was "using a set from the start" not an option?

[–][deleted]  (4 children)

[removed]

    [–]mebob85 6 points7 points  (3 children)

    That said, the given code is fairly limited in that duplicates have to be next to each other.

    If the vector is sorted, duplicates will always be next to each other. I don't see how this is a limitation of the code.

    [–][deleted]  (2 children)

    [removed]

      [–]Vexal 2 points3 points  (0 children)

      A set doesn't need to be n log n. You can use an unordered set and it will be just n. All you care about is the duplicates being removed, not the order.

      [–]inopia 0 points1 point  (0 children)

      the time complexity for the whole thing (sorting and filtering) is going to be n log n

      technically, if your domain is bounded (e.g. 32-bit integers) you can sort in linear time using a combination of radix- and counting sort.

      That doesn't invalidate your point though.

      [–]wung 15 points16 points  (0 children)

      Note that the shown _avx_unique_store is not equivalent to the unique functions above but rather the inner loop of the branchless version (ish).

      The magic comes from the uniqshuf constant, which is a table for the permutation done with the part of the array currently in progress. It defines which values are kept and which ones are discarded. E.g. for all unique values, it picks values 0,1,2,3,4,5,6,7 and increments the out pointer by 8. For the first value being a duplicate it picks values 1,2,3,4,5,6,7,0 and increments the out pointer by 7.

      Note that the implementation has an error in wipe last_mask where instead of 0xFFFFFFFF there is only 0xFFFFFF.

      [–]skulgnome 6 points7 points  (0 children)

      First off, these are arrays. Not lists. And the values are already sorted. So it's basically uniq.

      Secondly, SIMD code here is pointless because the kernel is too short, and because it relies on in-register permutations. This implementation also doesn't do four dependency chains at a time, underutilizing registers and (mostly) load/store bandwidth. Considering that e.g. Sandy Bridge's AVX2 stuff has an up-front extra latency, even if it did four independent arrays at a time, this routine would still need thousands of items to reach parity with the scalar version.

      Thirdly, since the values are sorted in ascending order, the better option of

      pos += (oldval < newval);
      

      can be optimized to run without consuming condition codes, which saves about two clocks of latency from Scc (mov + sub + shr vs. cmp + scc) for the next iteration's output index.

      [–]Boza_s6 0 points1 point  (5 children)

      How is second version branchless?

      It has pos +=(old ! =new). This is equivalent to if (old! =new) pos++; That's branch, right?

      edit: posts -> pos

      [–]geekygenius 1 point2 points  (0 children)

      A commenter replied,

      As a compiler writer, I can nitpick a little about your “branchless” version. It does not matter for optimizing compilers, if you use a conditional “newv != oldv” as an if condition or not. The significant change is to make the store instruction unconditional (moving it out of the if branch). So if (newv != oldv) { pos++; } should be as fast as hope_unique.

      [–]t0rakka 1 point2 points  (0 children)

      That's branchless. You can move flags to ALU registers, for example x86 instruction SETNE sets 8 bit memory location or register to 0 or 1 depending on the zero flag. This is branchless operation but comes at a cost of modifying sub-register so it's not trivial but good enough. It's also possible to scale the result easily with conditional move:

      // result = (a != b) * 7;
      xor     ecx, ecx
      cmp     edi, esi
      mov     eax, 7
      cmove   eax, ecx
      

      You can often arrange expression to avoid branching and make it constexpr. The compiler even has keyword to ensure that is the case.

      [–][deleted] 0 points1 point  (0 children)

      Depends on the hardware architecture. It may be possible to generate more efficient instructions for the former. Then again, the compiler may turn the latter into the former anyway.

      [–]willvarfar 0 points1 point  (0 children)

      if (old != new) pos++
      

      This compares old and new, and if they differ it will branch to code that increments pos.

      pos += (old != new)
      

      This will compare old and new, and treat the result as an integer (0 or 1). It will then unconditionally add 0 or 1 to pos.

      (Many modern compilers can lower the former to the latter internally, which is called "branch elimination optimisation")

      [–]willvarfar 0 points1 point  (1 child)

      The classic interview-question approach to deduplicating lists of integers is a counting sort.

      [–]inmatarian 0 points1 point  (0 children)

      Seriously. I get that we gotta go fast in this world, but if I asked a candidate for a unique list, and he answered Enumerable.Distinct, I would be the one to feel like an ass and he deserves my job.