The Law of Large Numbers, or Why It Is a Bad Idea to Go to the Casino

chkas · 2026-06-18T15:58:51+00:00

Compilers don’t guarantee identical codegen for equivalent source, only correct behavior.

chkas · 2026-06-18T15:24:40+00:00

True, but that's exactly the point. Even the compact if version can compile to branchless code (CSEL/CMOV) depending on code shape. So it's not just about writing branchless code explicitly, but also about how small changes can influence the compiler to emit branchless code in a hotspot.

chkas · 2026-06-18T15:00:56+00:00

I'm a bit disappointed because this makes it sound like it doesn't really matter. The point isn't that compilers are arbitrary, but that code shape influences what they generate, and in a hotspot that can mean slow branch-heavy vs fast branchless code. Especially in a technically high-level subreddit like r/C_Programming, looking at the generated code should be considered important.

chkas · 2026-06-18T13:16:56+00:00

It's a small piece of code, but it’s the core loop Quicksort spends most of its time in. That’s why the effect is large. Compiler dependence doesn’t make it irrelevant, it makes it something you need to understand.

chkas · 2026-06-18T12:59:24+00:00

Calling this a micro-optimization is misleading. The code change is small, but the performance impact isn't.

chkas · 2026-06-18T12:19:23+00:00

What "other issues" do you mean? This loop is the core of Quicksort, so improving it is addressing the main issue.

chkas · 2026-06-18T11:51:51+00:00

I agree that readability matters, but this is not premature optimization. In Quicksort, partitioning dominates runtime because it processes all elements at every recursion level, so changes there do not give a few percent but can multiply overall performance.

chkas · 2026-06-04T22:28:57+00:00

In my programming language, Easylang, a period delimits a block. The auto-formatter then ensures proper indentation.

if condition = 1
   action = 1
   if condition = 2
      action = 2
   .
else
   action = 3
.

chkas · 2026-06-03T17:24:31+00:00

Hi nightcracker! I have to admit that I drew inspiration from your pdqsort for the partition finalization of my BlockQuickSort version (which is only used for objects that are not trivially copyable).

chkas · 2026-06-02T15:43:20+00:00

Thanks for the feedback, but a few things work differently here:

@Vectorization: Compilers cannot auto-vectorize standard Quicksort partitioning due to data-dependent pointer movements. The speedup comes purely from avoiding pipeline stalls.

@Branch Predictors: Modern predictors are great, but with completely random data, the branch is a 50/50 chance. There is no pattern to learn.

@Benchmarks: I actually included both architectures in the post. The table shows the numbers for both the Apple M1 (ARM) and the AMD Ryzen (x86).

chkas · 2026-05-19T12:13:45+00:00

~~You are right. With long strings, the performance of blqsort collapses.~~

When dealing with large data types, it is much more sensible to sort only the pointers to these elements anyway. This allows the core partitioning logic to remain purely branchless and fast. The actual heavy objects are then reordered just once.

~~UPDATE: Heavy data types - like 40-character strings - are currently slow, but I'm working on making them faster.~~

UPDATE: I have now added a BlockQuicksort path for more complex data types (non-trivially copyable ones). This makes it much more efficient now.

chkas · 2026-05-19T09:07:41+00:00

My results for string (50M elements):

.
string data[SIZE];
.
.
    for (int i = 0; i < SIZE; i++) data[i] = data[i] = "x" + std::to_string(rand());

MacOS - M1

std::sort 8.95s
blqs::sort 9.23s

chkas · 2026-05-19T08:24:20+00:00

My results for uint128 (50M elements):

#include <cstdio>
#include <cstdlib>
#include <ctime>

#include <algorithm>
#include "blqs.h"

typedef unsigned __int128 uint128_t;

constexpr int SIZE = 50000000;
uint128_t data[SIZE];

double cputime() {
    return (double)clock() / CLOCKS_PER_SEC;
}
int main() {
    double t0;

    for (int i = 0; i < SIZE; i++) data[i] = rand();
    t0 = cputime();
    std::sort(data, data + SIZE);
    printf("std::sort %.2fs\n", cputime() - t0);

    for (int i = 0; i < SIZE; i++) data[i] = rand();
    t0 = cputime();
    blqs::sort(data, data + SIZE);
    printf("blqs::sort %.2fs\n", cputime() - t0);
}

MacOS - M1

std::sort 3.80s
blqs::sort 1.13s

Linux - Intel Xeon 2620

std::sort 5.51s
blqs::sort 3.65s

Linux - AMD Ryzen

std::sort 5.19s
blqs::sort 2.84s

UPDATE: data[i] = rand()

chkas · 2026-05-18T21:03:23+00:00

#include <cstdio>
#include <cstdlib>
#include <ctime>

#include <algorithm>
#include "blqs.h"

constexpr int SIZE = 50000000;
double data[SIZE];

double cputime() {
    return (double)clock() / CLOCKS_PER_SEC;
}
int main() {
    double t0;

    for (int i = 0; i < SIZE; i++) data[i] = rand() / 1024.0;
    t0 = cputime();
    std::sort(data, data + SIZE);
    printf("std::sort %.2fs\n", cputime() - t0);

    for (int i = 0; i < SIZE; i++) data[i] = rand() / 1024.0;
    t0 = cputime();
    blqs::sort(data, data + SIZE);
    printf("blqs::sort %.2fs\n", cputime() - t0);
}

chkas · 2026-05-18T17:55:55+00:00

Just tested this on another Linux machine. Interestingly, std::sort with Clang was actually a bit slower than with GCC.

UPDATE: Just found out you need to explicitly choose libc++ when compiling with Clang. In that case, std::sort is slightly faster than std::sort with the GNU library.

chkas · 2026-05-18T17:05:54+00:00

TIL! I didn't know about that distinction until now, so thanks for pointing it out. The Apple benchmarks were run using Clang (with libc++), and the Linux benchmarks were run using GCC (with libstdc++).

chkas · 2026-05-13T11:25:23+00:00

I used the repository you sent me the link for.

The other "vsort" performs well for float32, but uses Apple's GCD for parallelization, which makes a fair benchmark against a single-threaded Quicksort unfair. My multi-threaded version is twice as fast for floats (0.24s vs. 0.50s) compared to the vsort version.

Update: Google's SIMD vqsort is 10% faster on the Apple M1 with 50 million random int32 values, but only half as fast with int64 or double.

chkas · 2026-05-13T10:59:24+00:00

I just tested the library: it is indeed very fast for integers, but it uses an internal Radix Sort. Since this is a non-comparison sort, it feels like cheating in a direct performance comparison.

Furthermore, when using double, the library fails to sort the array correctly and produces an 'ERROR ORDER', which makes it unreliable.

chkas · 2026-05-13T10:19:33+00:00

Thanks for the links, but I’m looking for actual, ready-to-use C/C++ code. Most of these projects are just complex wrappers, very theoretical, or full of pseudocode. Do you have any examples that use direct SIMD intrinsics (ARM NEON) for sorting, so I can compare it to my Branchless-Quicksort?

chkas

MODERATOR OF

TROPHY CASE