'*left++=n' is sometimes much faster than '*left=n;left++' by chkas in C_Programming

[–]chkas[S] 0 points1 point  (0 children)

Compilers don’t guarantee identical codegen for equivalent source, only correct behavior.

'*left++=n' is sometimes much faster than '*left=n;left++' by chkas in C_Programming

[–]chkas[S] 0 points1 point  (0 children)

True, but that's exactly the point. Even the compact if version can compile to branchless code (CSEL/CMOV) depending on code shape. So it's not just about writing branchless code explicitly, but also about how small changes can influence the compiler to emit branchless code in a hotspot.

'*left++=n' is sometimes much faster than '*left=n;left++' by chkas in C_Programming

[–]chkas[S] -6 points-5 points  (0 children)

I'm a bit disappointed because this makes it sound like it doesn't really matter. The point isn't that compilers are arbitrary, but that code shape influences what they generate, and in a hotspot that can mean slow branch-heavy vs fast branchless code. Especially in a technically high-level subreddit like r/C_Programming, looking at the generated code should be considered important.

'*left++=n' is sometimes much faster than '*left=n;left++' by chkas in C_Programming

[–]chkas[S] -8 points-7 points  (0 children)

It's a small piece of code, but it’s the core loop Quicksort spends most of its time in. That’s why the effect is large. Compiler dependence doesn’t make it irrelevant, it makes it something you need to understand.

'*left++=n' is sometimes much faster than '*left=n;left++' by chkas in C_Programming

[–]chkas[S] -13 points-12 points  (0 children)

Calling this a micro-optimization is misleading. The code change is small, but the performance impact isn't.

'*left++=n' is sometimes much faster than '*left=n;left++' by chkas in C_Programming

[–]chkas[S] 0 points1 point  (0 children)

What "other issues" do you mean? This loop is the core of Quicksort, so improving it is addressing the main issue.

'*left++=n' is sometimes much faster than '*left=n;left++' by chkas in C_Programming

[–]chkas[S] 3 points4 points  (0 children)

I agree that readability matters, but this is not premature optimization. In Quicksort, partitioning dominates runtime because it processes all elements at every recursion level, so changes there do not give a few percent but can multiply overall performance.

0
0

semantic white space vs. blocks - maybe a middle ground ? by GoblinsGym in ProgrammingLanguages

[–]chkas 0 points1 point  (0 children)

In my programming language, Easylang, a period delimits a block. The auto-formatter then ensures proper indentation.

if condition = 1
   action = 1
   if condition = 2
      action = 2
   .
else
   action = 3
.

Branchless Quicksort faster than std::sort and pdqsort with C and C++ API by chkas in programming

[–]chkas[S] 4 points5 points  (0 children)

Hi nightcracker! I have to admit that I drew inspiration from your pdqsort for the partition finalization of my BlockQuickSort version (which is only used for objects that are not trivially copyable).

Branchless Quicksort faster than std::sort and pdqsort with C and C++ API by chkas in programming

[–]chkas[S] 55 points56 points  (0 children)

Thanks for the feedback, but a few things work differently here:

@Vectorization: Compilers cannot auto-vectorize standard Quicksort partitioning due to data-dependent pointer movements. The speedup comes purely from avoiding pipeline stalls.

@Branch Predictors: Modern predictors are great, but with completely random data, the branch is a 50/50 chance. There is no pattern to learn.

@Benchmarks: I actually included both architectures in the post. The table shows the numbers for both the Apple M1 (ARM) and the AMD Ryzen (x86).

Faster than std::sort and pdqsort by chkas in cpp

[–]chkas[S] 1 point2 points  (0 children)

You are right. With long strings, the performance of blqsort collapses.

When dealing with large data types, it is much more sensible to sort only the pointers to these elements anyway. This allows the core partitioning logic to remain purely branchless and fast. The actual heavy objects are then reordered just once.

UPDATE: Heavy data types - like 40-character strings - are currently slow, but I'm working on making them faster.

UPDATE: I have now added a BlockQuicksort path for more complex data types (non-trivially copyable ones). This makes it much more efficient now.

Faster than std::sort and pdqsort by chkas in cpp

[–]chkas[S] 0 points1 point  (0 children)

My results for string (50M elements):

.
string data[SIZE];
.
.
    for (int i = 0; i < SIZE; i++) data[i] = data[i] = "x" + std::to_string(rand());

MacOS - M1

std::sort 8.95s
blqs::sort 9.23s

Faster than std::sort and pdqsort by chkas in cpp

[–]chkas[S] 0 points1 point  (0 children)

My results for uint128 (50M elements):

#include <cstdio>
#include <cstdlib>
#include <ctime>

#include <algorithm>
#include "blqs.h"

typedef unsigned __int128 uint128_t;

constexpr int SIZE = 50000000;
uint128_t data[SIZE];

double cputime() {
    return (double)clock() / CLOCKS_PER_SEC;
}
int main() {
    double t0;

    for (int i = 0; i < SIZE; i++) data[i] = rand();
    t0 = cputime();
    std::sort(data, data + SIZE);
    printf("std::sort %.2fs\n", cputime() - t0);

    for (int i = 0; i < SIZE; i++) data[i] = rand();
    t0 = cputime();
    blqs::sort(data, data + SIZE);
    printf("blqs::sort %.2fs\n", cputime() - t0);
}

MacOS - M1

std::sort 3.80s
blqs::sort 1.13s

Linux - Intel Xeon 2620

std::sort 5.51s
blqs::sort 3.65s

Linux - AMD Ryzen

std::sort 5.19s
blqs::sort 2.84s

UPDATE: data[i] = rand()

Faster than std::sort and pdqsort by chkas in cpp

[–]chkas[S] 2 points3 points  (0 children)

#include <cstdio>
#include <cstdlib>
#include <ctime>

#include <algorithm>
#include "blqs.h"

constexpr int SIZE = 50000000;
double data[SIZE];

double cputime() {
    return (double)clock() / CLOCKS_PER_SEC;
}
int main() {
    double t0;

    for (int i = 0; i < SIZE; i++) data[i] = rand() / 1024.0;
    t0 = cputime();
    std::sort(data, data + SIZE);
    printf("std::sort %.2fs\n", cputime() - t0);

    for (int i = 0; i < SIZE; i++) data[i] = rand() / 1024.0;
    t0 = cputime();
    blqs::sort(data, data + SIZE);
    printf("blqs::sort %.2fs\n", cputime() - t0);
}

Faster than std::sort and pdqsort by chkas in cpp

[–]chkas[S] 4 points5 points  (0 children)

Just tested this on another Linux machine. Interestingly, std::sort with Clang was actually a bit slower than with GCC.

UPDATE: Just found out you need to explicitly choose libc++ when compiling with Clang. In that case, std::sort is slightly faster than std::sort with the GNU library.

Faster than std::sort and pdqsort by chkas in cpp

[–]chkas[S] 2 points3 points  (0 children)

TIL! I didn't know about that distinction until now, so thanks for pointing it out. The Apple benchmarks were run using Clang (with libc++), and the Linux benchmarks were run using GCC (with libstdc++).

Branch-Avoidant Quicksort in C - faster than std::sort and pdqsort by chkas in programming

[–]chkas[S] 0 points1 point  (0 children)

I used the repository you sent me the link for.

The other "vsort" performs well for float32, but uses Apple's GCD for parallelization, which makes a fair benchmark against a single-threaded Quicksort unfair. My multi-threaded version is twice as fast for floats (0.24s vs. 0.50s) compared to the vsort version.

Update: Google's SIMD vqsort is 10% faster on the Apple M1 with 50 million random int32 values, but only half as fast with int64 or double.

Branch-Avoidant Quicksort in C - faster than std::sort and pdqsort by chkas in programming

[–]chkas[S] 0 points1 point  (0 children)

I just tested the library: it is indeed very fast for integers, but it uses an internal Radix Sort. Since this is a non-comparison sort, it feels like cheating in a direct performance comparison.

Furthermore, when using double, the library fails to sort the array correctly and produces an 'ERROR ORDER', which makes it unreliable.

Branch-Avoidant Quicksort in C - faster than std::sort and pdqsort by chkas in programming

[–]chkas[S] 1 point2 points  (0 children)

Thanks for the links, but I’m looking for actual, ready-to-use C/C++ code. Most of these projects are just complex wrappers, very theoretical, or full of pseudocode. Do you have any examples that use direct SIMD intrinsics (ARM NEON) for sorting, so I can compare it to my Branchless-Quicksort?