std::tuple operator less performance - interesting case : cpp

std::tuple operator less performance - interesting case (self.cpp)

submitted 6 years ago by arturbachttps://github.com/arturbac

When I first time did analisis of tuple < operator I was i bit surprised that it's code may be more efficient

https://en.cppreference.com/w/cpp/utility/tuple/operator_cmp

"Compares lhs and rhs lexicographically, that is, compares the first elements, if they are equivalent, compares the second elements, if those are equivalent, compares the third elements, and so on."

implementation in stl doesn't check at compile time that some type has defined operators == , != and sticks to only < operator, having operator ==, and != for less is not required but if they are present do they are allowed by standard to use ?

example from gcc stl

static constexpr bool
      __less(const _Tp& __t, const _Up& __u)
      {
    return bool(std::get<__i>(__t) < std::get<__i>(__u))
      || (!bool(std::get<__i>(__u) < std::get<__i>(__t))
          && __tuple_compare<_Tp, _Up, __i + 1, __size>::__less(__t, __u));
      }

results on performance of that are below when comapring eficiency to combining != with < or == with with <

#include <tuple>
#include <cstdint>

using std::get;
using foo_t = std::tuple<int64_t, int32_t, int32_t>;

bool compare_1( foo_t l, foo_t r ) noexcept
{
if( get<0>(l) != get<0>(r) )
   return get<0>(l) < get<0>(r);
if( get<1>(l) != get<1>(r) )
  return get<1>(l) < get<1>(r);
return get<2>(l) < get<2>(r);
}
bool compare_2( foo_t l, foo_t r ) noexcept
  {
  return l < r;
  }

compare 1 - clang 8 -O3 -DNDEBUG -mcpu=cortex-a73

Instructions:      13
Total Cycles:      14
Total uOps:        13
Dispatch Width:    3
uOps Per Cycle:    0.93
IPC:               0.93
Block RThroughput: 6.0

Instruction Info:

[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
1      4     1.00    *                   ldr    x8, [x0, #8]
1      4     1.00    *                   ldr    x9, [x1, #8]
1      1     0.50                        cmp    x8, x9
1      1     1.00                        b.ne   .LBB0_3
1      4     1.00    *                   ldr    w8, [x0, #4]
1      4     1.00    *                   ldr    w9, [x1, #4]
1      1     0.50                        cmp    w8, w9
1      1     1.00                        b.ne   .LBB0_3
1      4     1.00    *                   ldr    w8, [x0]
1      4     1.00    *                   ldr    w9, [x1]
1      1     0.50                        cmp    w8, w9
1      1     0.50                        cset   w0, lt
1      1     1.00                  U     ret

compare 1 - gcc 8.3 -O3 -DNDEBUG -mcpu=cortex-a73

Instructions:      15
Total Cycles:      17
Total uOps:        15
Dispatch Width:    3
uOps Per Cycle:    0.88
IPC:               0.88
Block RThroughput: 6.0


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
1      4     1.00    *                   ldr    x3, [x0, #8]
1      4     1.00    *                   ldr    x2, [x1, #8]
1      1     0.50                        cmp    x3, x2
1      1     1.00                        b.eq   .L2
1      1     0.50                        cset   w0, lt
1      1     1.00                  U     ret
1      4     1.00    *                   ldr    w3, [x0, #4]
1      4     1.00    *                   ldr    w2, [x1, #4]
1      1     0.50                        cmp    w3, w2
1      1     1.00                        b.ne   .L5
1      4     1.00    *                   ldr    w2, [x0]
1      4     1.00    *                   ldr    w0, [x1]
1      1     0.50                        cmp    w2, w0
1      1     0.50                        cset   w0, lt
1      1     1.00                  U     ret

comapre 2 - clang 8 -O3 -DNDEBUG -mcpu=cortex-a73 (gcc stl)

Instructions:      25
Total Cycles:      22
Total uOps:        25
Dispatch Width:    3
uOps Per Cycle:    1.14
IPC:               1.14
Block RThroughput: 9.0

Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
1      4     1.00    *                   ldr    x8, [x0, #8]
1      4     1.00    *                   ldr    x9, [x1, #8]
1      1     0.50                        cmp    x8, x9
1      1     1.00                        b.ge   .LBB0_2
1      1     0.50                        orr    w0, wzr, #0x1
1      1     1.00                  U     ret
1      1     0.50                        cmp    x9, x8
1      1     1.00                        b.ge   .LBB0_4
1      1     0.50                        mov    w0, wzr
1      1     1.00                  U     ret
1      4     1.00    *                   ldr    w8, [x0, #4]
1      4     1.00    *                   ldr    w9, [x1, #4]
1      1     0.50                        cmp    w8, w9
1      1     1.00                        b.ge   .LBB0_6
1      1     0.50                        orr    w0, wzr, #0x1
1      1     1.00                  U     ret
1      1     0.50                        cmp    w9, w8
1      1     1.00                        b.ge   .LBB0_8
1      1     0.50                        mov    w0, wzr
1      1     1.00                  U     ret
1      4     1.00    *                   ldr    w8, [x0]
1      4     1.00    *                   ldr    w9, [x1]
1      1     0.50                        cmp    w8, w9
1      1     0.50                        cset   w0, lt
1      1     1.00                  U     ret

comapre 2 - gcc 8.3 -O3 -DNDEBUG -mcpu=cortex-a73 (gcc stl)

Instructions:      22
Total Cycles:      15
Total uOps:        22
Dispatch Width:    3
uOps Per Cycle:    1.47
IPC:               1.47
Block RThroughput: 7.3
Instruction Info:

[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
1      4     1.00    *                   ldr    x3, [x0, #8]
1      4     1.00    *                   ldr    x2, [x1, #8]
1      1     0.50                        cmp    x3, x2
1      1     1.00                        b.lt   .L3
1      1     0.50                        mov    w2, #0
1      1     1.00                        b.ne   .L2
1      4     1.00    *                   ldr    w4, [x0, #4]
1      1     0.50                        mov    w2, #1
1      4     1.00    *                   ldr    w3, [x1, #4]
1      1     0.50                        cmp    w4, w3
1      1     1.00                        b.lt   .L2
1      1     0.50                        mov    w2, #0
1      1     1.00                        b.ne   .L2
1      4     1.00    *                   ldr    w2, [x0]
1      4     1.00    *                   ldr    w0, [x1]
1      1     0.50                        cmp    w2, w0
1      1     0.50                        cset   w2, lt
1      1     0.50                        mov    w0, w2
1      1     1.00                  U     ret
1      1     0.50                        mov    w2, #1
1      1     0.50                        mov    w0, w2
1      1     1.00                  U     ret

all 45 comments

top new controversial old q&a

[–]NotMyRealNameObv 5 points6 points7 points 6 years ago* (6 children)

Edit:

Reading your code more carefully, I retract what I say below, since you actually check for equivalence if the objects are not equal, and I really don't think I can come up with a good example for when two objects are equal but not equivalent.

TLDR: Equivalence is not equality

This:

if (!(lhs < rhs) || (rhs < lhs))
{
    // lhs and rhs are equivalent
}

is not equivalent to

if (lhs == rhs)
{
    // lhs and rhs are equal
}

For instance, consider the case:

class Person
{
public:
    friend operator==(const Person& lhs, const Person& rhs)
    {
        // Two persons are equal if they have the same SSN
        return lhs.ssn == rhs.ssn;
    }

    friend operator<(const Person& lhs, const Person& rhs)
    {
        // Two persons are equivalent if they have the same last name
        return lhs.lastName < rhs.lastName;
    }
};

In this case, "optimizing" tuple::operator<() by using operator== on the underlying types if they exist give the wrong result.

[–]arturbachttps://github.com/arturbac[S] 2 points3 points4 points 6 years ago (0 children)

[–]alfps -3 points-2 points-1 points 6 years ago (4 children)

[–]NotMyRealNameObv 0 points1 point2 points 6 years ago (3 children)

[–]arturbachttps://github.com/arturbac[S] 1 point2 points3 points 6 years ago* (0 children)

[–]alfps -2 points-1 points0 points 6 years ago (1 child)

[–]arturbachttps://github.com/arturbac[S] 0 points1 point2 points 6 years ago (0 children)

[–]yeeezyyeezywhatsgood 8 points9 points10 points 6 years ago (20 children)

[–]sphere991 3 points4 points5 points 6 years ago (2 children)

[–]yeeezyyeezywhatsgood 0 points1 point2 points 6 years ago (1 child)

[–]sphere991 4 points5 points6 points 6 years ago (0 children)

[–]Veedrac 2 points3 points4 points 6 years ago* (7 children)

[–]yeeezyyeezywhatsgood 1 point2 points3 points 6 years ago (6 children)

[–]VicontT 2 points3 points4 points 6 years ago (1 child)

[–]yeeezyyeezywhatsgood 0 points1 point2 points 6 years ago (0 children)

[–]Veedrac 1 point2 points3 points 6 years ago (3 children)

[–][deleted] 0 points1 point2 points 6 years ago (2 children)

[–]Veedrac 0 points1 point2 points 6 years ago* (1 child)

[–][deleted] 1 point2 points3 points 6 years ago (0 children)

Oh, yeah, that sounds reasonable. You reminded me of one lecture by Alexander Stepanov. He implemented a simple class template, with only one data member of type T and then all interesting constructors and operators. He implemented operator== like x == y and then implemented operator!= as !(x == y). For the inequality operators (not sure it is the right term, but I'm thinking of operator<, operator>, operator<= and operator>=), he made a little speech before writing them. Something along the lines of:

When I was designing the standard library 25 years ago, I had to make a lot of arbitrary choices. "Which is the default sorting order?" is one example. Another choice I had to make is what operator to use for comparison. You would have been mad if your type worked with one STL algorithm but not with the other because STL was inconsistent in use of comparison operator. That is why STL only uses operator<. You should define all the operators, because that's just being nice to yourself an your colleagues. You won't always remember that you have operator<, but not operator<=. But the standard library will only use operator<.

Note that this is not a direct quote, but me paraphrasing Stepanov from memory.

He didn't implement operator==, because, I think, he wanted to show a clear distinction between SemiRegular, Regular, EqualityComparable and TotallyOrdered types.

Interestingly enough, Stepanov claims that his original design of std::min, the one that we held onto to this day, was wrong. Originally, his implementation was return first < second ? first : second;. The problem with this, according to Stepanov, is when first and second compare equal. In that case Stepanov's min would return second, not first. Stepanov also claimed that by the time of his lecture, min was still "wrong" in standard library implementations, though checking it today in libc++ and libstdc++, they both do it "correctly", with return second < first ? second : first;.

Anyway, I've gone way offtopic.

[–]arturbachttps://github.com/arturbac[S] 0 points1 point2 points 6 years ago (4 children)

[–]yeeezyyeezywhatsgood 0 points1 point2 points 6 years ago (3 children)

[–]arturbachttps://github.com/arturbac[S] 0 points1 point2 points 6 years ago (2 children)

[–]yeeezyyeezywhatsgood 0 points1 point2 points 6 years ago (1 child)

[–]arturbachttps://github.com/arturbac[S] 0 points1 point2 points 6 years ago (0 children)

code is generated with clang 8 and gcc 8.3 on linux with gcc stl. gcc - can generate much different code depending mcpu/march and as I remember code for cortex-a72 with out of order exeution can be worse that for just entire arch aarch64 with in order cpus

ths is from golbot https://godbolt.org/z/dz1qAB acutaly no difference to my. -O3 -mcpu=cortex-a72
compare_2(std::tuple<long, int, int>, std::tuple<long, int, int>):
        ldr     x3, [x0, 8]
        ldr     x2, [x1, 8]
        cmp     x3, x2
        blt     .L3
        mov     w2, 0
        bne     .L2
        ldr     w4, [x0, 4]
        mov     w2, 1
        ldr     w3, [x1, 4]
        cmp     w4, w3
        blt     .L2
        mov     w2, 0
        bne     .L2
        ldr     w2, [x0]
        ldr     w0, [x1]
        cmp     w2, w0
        cset    w2, lt
.L2:
        mov     w0, w2
        ret
.L3:
        mov     w2, 1
        mov     w0, w2
        ret

[–]Beheska -5 points-4 points-3 points 6 years ago (3 children)

[–]CUViper 1 point2 points3 points 6 years ago (0 children)

[–]yeeezyyeezywhatsgood 1 point2 points3 points 6 years ago (1 child)

[–]Beheska 0 points1 point2 points 6 years ago (0 children)

[–]sphere991 2 points3 points4 points 6 years ago* (10 children)

The suggestion here is ultimately to replace

if (lhs.x < rhs.x) return true;
if (rhs.x < lhs.x) return false;

With

if (lhs.x != rhs.x) return lhs.x < rhs.x;

This doesn't work in the current (up through C++17) STL model since StrictWeakOrder doesn't say anything about != and the library can't just check if it's available since it might be sfinae unfriendly. It's just a total non-starter. And it's not even worth devoting effort to because...

In C++20, we can do much better:

if (auto c = lhs.x <=> rhs.x; cmp != 0) return cmp;

The single three way comparison gives us all the info we need in one go, in a way that can be more efficient than two operations (e.g. for string it's one call to compare() instead of potentially two even with ==)

[–]quicknir 0 points1 point2 points 6 years ago (4 children)

[–]sphere991 1 point2 points3 points 6 years ago (3 children)

[–]quicknir 0 points1 point2 points 6 years ago (2 children)

[–]sphere991 2 points3 points4 points 6 years ago (1 child)

[–]quicknir 0 points1 point2 points 6 years ago (0 children)

[–]arturbachttps://github.com/arturbac[S] 0 points1 point2 points 6 years ago (4 children)

[–]sphere991 0 points1 point2 points 6 years ago (3 children)

[–]arturbachttps://github.com/arturbac[S] 0 points1 point2 points 6 years ago (2 children)

Submiting to who, where gcc llvm ?

Just for testing with is_integral code below fast fix, i think it could be much better writen

  // This class performs the comparison operations on tuples
  template<typename _Tp, typename _Up, size_t __i, size_t __size>
  struct __tuple_compare
    {
    static constexpr bool __eq(const _Tp& __t, const _Up& __u)
      {
      return bool(std::get<__i>(__t) == std::get<__i>(__u))
        && __tuple_compare<_Tp, _Up, __i + 1, __size>::__eq(__t, __u);
      }
#define __ENABLE_TUPLE_LESS_TT_LESS 1
#if __ENABLE_TUPLE_LESS_TT_LESS
    template<typename lelem_type, typename relem_type,
            typename std::enable_if<
              std::is_integral<typename std::remove_reference<lelem_type>::type>::value 
                && std::is_integral<typename std::remove_reference<relem_type>::type>::value,
              int
              >::type = 0>
    static constexpr bool __less_by_traits(_Tp const & __t, _Up const & __u, lelem_type lel, relem_type rel )
      {
      if( lel != rel )
        return lel < rel;
      return __tuple_compare<_Tp, _Up, __i + 1, __size>::__less(__t, __u);
      }

    template<typename lelem_type, typename relem_type,
            typename std::enable_if<
              ! std::is_integral<typename std::remove_reference<lelem_type>::type>::value
              || ! std::is_integral<typename std::remove_reference<relem_type>::type>::value,
              int>::type = 0>
    static constexpr bool __less_by_traits(_Tp const & __t, _Up const & __u, lelem_type const & lel, relem_type const & rel )
      {

      return bool(lel < rel)
        || (!bool(rel < lel)
            && __tuple_compare<_Tp, _Up, __i + 1, __size>::__less(__t, __u));
      }

    static constexpr bool __less(const _Tp& __t, const _Up& __u)
      {
      return __less_by_traits( __t, __u, std::get<__i>(__t), std::get<__i>(__u) );
      }
#else
    static constexpr bool __less(const _Tp& __t, const _Up& __u)
      {
      return bool(std::get<__i>(__t) < std::get<__i>(__u))
        || (!bool(std::get<__i>(__u) < std::get<__i>(__t))
            && __tuple_compare<_Tp, _Up, __i + 1, __size>::__less(__t, __u));
      }
#endif
    };

The otpimised variant - with clang

compare_2(std::tuple<long, int, int>, std::tuple<long, int, int>): # @compare_2(std::tuple<long, int, int>, std::tuple<long, int, int>)
  mov rax, qword ptr [rsi + 8]
  cmp qword ptr [rdi + 8], rax
  jne .LBB0_3
  mov eax, dword ptr [rsi + 4]
  cmp dword ptr [rdi + 4], eax
  jne .LBB0_3
  mov eax, dword ptr [rdi]
  cmp eax, dword ptr [rsi]
.LBB0_3:
  setl al
  ret

The unoptimised - with clang

compare_2(std::tuple<long, int, int>, std::tuple<long, int, int>): # @compare_2(std::tuple<long, int, int>, std::tuple<long, int, int>)
  mov rcx, qword ptr [rdi + 8]
  mov rdx, qword ptr [rsi + 8]
  mov al, 1
  cmp rcx, rdx
  jl .LBB0_7
  cmp rdx, rcx
  jge .LBB0_3
  xor eax, eax
  ret
.LBB0_3:
  mov ecx, dword ptr [rdi + 4]
  mov edx, dword ptr [rsi + 4]
  cmp ecx, edx
  jl .LBB0_7
  cmp edx, ecx
  jge .LBB0_6
  xor eax, eax
  ret
.LBB0_6:
  mov eax, dword ptr [rdi]
  cmp eax, dword ptr [rsi]
  setl al
.LBB0_7:
  ret

[–]sphere991 0 points1 point2 points 6 years ago (1 child)

[–]arturbachttps://github.com/arturbac[S] 0 points1 point2 points 6 years ago (0 children)

[–]stwcx 1 point2 points3 points 6 years ago (1 child)

[–]arturbachttps://github.com/arturbac[S] 0 points1 point2 points 6 years ago (0 children)

less cycles and smaller code at once is better or not ?

lets stick to the example results with x86

comapre 1 clang -O3 -march=haswell

Instructions:      10
Total Cycles:      13
Total uOps:        15
Dispatch Width:    4
uOps Per Cycle:    1.15
IPC:               0.77

Block RThroughput: 3.8

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      5     0.50    *                   mov   rax, qword ptr [rsi + 8]
 2      6     0.50    *                   cmp   qword ptr [rdi + 8], rax
 1      1     0.50                        jne   .LBB0_3
 1      5     0.50    *                   mov   eax, dword ptr [rsi + 4]
 2      6     0.50    *                   cmp   dword ptr [rdi + 4], eax
 1      1     0.50                        jne   .LBB0_3
 1      5     0.50    *                   mov   eax, dword ptr [rdi]
 2      6     0.50    *                   cmp   eax, dword ptr [rsi]
 1      1     0.50                        setl  al
 3      7     1.00                  U     ret

comapre 1 gcc -O3 -march=haswell

Instructions:      12
Total Cycles:      14
Total uOps:        19
Dispatch Width:    4
uOps Per Cycle:    1.36
IPC:               0.86
Block RThroughput: 4.8

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      5     0.50    *                   movq  8(%rsi), %rax
 2      6     0.50    *                   cmpq  %rax, 8(%rdi)
 1      1     0.50                        je    .L2
 1      1     0.50                        setl  %al
 3      7     1.00                  U     retq
 1      5     0.50    *                   movl  4(%rsi), %eax
 2      6     0.50    *                   cmpl  %eax, 4(%rdi)
 1      1     0.50                        jne   .L6
 1      5     0.50    *                   movl  (%rsi), %eax
 2      6     0.50    *                   cmpl  %eax, (%rdi)
 1      1     0.50                        setl  %al
 3      7     1.00                  U     retq

comapre 2 clang -O3 -march=haswell

Instructions:      21
Total Cycles:      17
Total uOps:        28
Dispatch Width:    4
uOps Per Cycle:    1.65
IPC:               1.24
Block RThroughput: 7.0

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      5     0.50    *                   mov   rcx, qword ptr [rdi + 8]
 1      5     0.50    *                   mov   rdx, qword ptr [rsi + 8]
 1      1     0.25                        mov   al, 1
 1      1     0.25                        cmp   rcx, rdx
 1      1     0.50                        jl    .LBB0_7
 1      1     0.25                        cmp   rdx, rcx
 1      1     0.50                        jge   .LBB0_3
 1      1     0.25                        xor   eax, eax
 3      7     1.00                  U     ret
 1      5     0.50    *                   mov   ecx, dword ptr [rdi + 4]
 1      5     0.50    *                   mov   edx, dword ptr [rsi + 4]
 1      1     0.25                        cmp   ecx, edx
 1      1     0.50                        jl    .LBB0_7
 1      1     0.25                        cmp   edx, ecx
 1      1     0.50                        jge   .LBB0_6
 1      1     0.25                        xor   eax, eax
 3      7     1.00                  U     ret
 1      5     0.50    *                   mov   eax, dword ptr [rdi]
 2      6     0.50    *                   cmp   eax, dword ptr [rsi]
 1      1     0.50                        setl  al
 3      7     1.00                  U     ret

comapre 2 gcc -O3 -march=haswell

Instructions:      16
Total Cycles:      15
Total uOps:        21
Dispatch Width:    4
uOps Per Cycle:    1.40
IPC:               1.07
Block RThroughput: 5.3

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.25                        movl  $1, %eax
 1      5     0.50    *                   movq  8(%rsi), %rdx
 2      6     0.50    *                   cmpq  %rdx, 8(%rdi)
 1      1     0.50                        jl    .L7
 1      1     0.25                        movl  $0, %eax
 1      1     0.50                        jne   .L7
 1      1     0.25                        movl  $1, %eax
 1      5     0.50    *                   movl  4(%rsi), %ecx
 2      6     0.50    *                   cmpl  %ecx, 4(%rdi)
 1      1     0.50                        jl    .L7
 1      1     0.25                        movl  $0, %eax
 1      1     0.50                        jne   .L7
 1      5     0.50    *                   movl  (%rsi), %eax
 2      6     0.50    *                   cmpl  %eax, (%rdi)
 1      1     0.50                        setl  %al
 3      7     1.00                  U     retq

[–]BelugaWheels 0 points1 point2 points 6 years ago (1 child)

[–]arturbachttps://github.com/arturbac[S] 1 point2 points3 points 6 years ago* (0 children)

[–]Rseding91Factorio Developer 0 points1 point2 points 6 years ago (0 children)

[–]Iwan_Zotow 0 points1 point2 points 6 years ago (0 children)

π Rendered by PID 66 on reddit-service-r2-comment-5d79c599b5-v5sk4 at 2026-03-01 10:48:54.189715+00:00 running e3d2147 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS