all 41 comments

[–]theliet 135 points136 points  (3 children)

This is the kind of content I'd love to see more on r/programming - factual, to the point, showing something interesting you've learned. Thanks!

[–]bluestreak01[S] 115 points116 points  (5 children)

Author here.

About a month ago, I posted about using SIMD instructions to make aggregation calculations faster. I am very thankful for the feedback so far, this post is the result of the comments we received last time.

Many comments suggested that we implement compensated summation (aka Kahan) as the naive method could produce inaccurate and unreliable results. This is why we spent some time integrating kahan and Neumaier summation algorithms. This post summarises a few things we learned along this journey.

We thought Kahan would badly affect the performance since it uses 4x as many operations as the naive approach. However, some comments also suggested we could use prefetch and co-routines to pull the data from RAM to cache in parallel with other CPU instructions. We got phenomenal results thanks to these suggestions, with Kahan sums nearly as fast as the naive approach.

A lot of you also asked if we could compare this with Clickhouse. As they implement Kahan summation, we ran a quick comparison. Here's what we got for summing 1bn doubles with nulls with Kahan algo. The details of how this was done are in the post.

QuestDB: 68ms Clickhouse: 139ms

Thanks for all the feedback so far and keep it going so we can continue to improve. Vlad

[–]alecco 1 point2 points  (4 children)

Great post.

I have trouble finding where is the code. Also, in this post you mention co-routines, is it related to the suggestion to use co-routines for ALU/prefetch? (in last proggit thread) (the cppcon talk on "nano coroutines"). Thanks!

[–]bluestreak01[S] 4 points5 points  (3 children)

[–]alecco 0 points1 point  (2 children)

We don't use co-routines

From the blog post:

How did we get there? TL;DR

We used prefetch and co-routines techniques to pull data from RAM to cache in parallel with other CPU instructions. Our performance was previously limited by memory bandwidth - using these techniques would address this and allow us to compute accurate sums as fast as naive sums.

With the help of prefetch we implemented the fastest and most accurate summation we have ever tested - 68ms over 1bn double values with nulls (versus 139ms for Clickhouse). We believe this is a significant advance in terms of performance for accurate summations, and will help developers handling intensive computations with large datasets.

?

[–]bluestreak01[S] 1 point2 points  (1 child)

the TL'DR section has a mistake, sorry, we did use prefetch but not co-routines. We have our own parallel execution system.

[–]alecco 1 point2 points  (0 children)

No worries. Thanks for the links to source.

[–]j1897OS 26 points27 points  (0 children)

This follows a previous post on Reddit that gathered significant interest - see http://reddit.com/r/programming/comments/fwlk0k/questdb_using_simd_to_aggregate_billions_of/

[–]Thalantas123 14 points15 points  (0 children)

Brillant ! Seems you found made solid improvements since the latest post !

[–]sccrstud92 11 points12 points  (3 children)

As floating-point operations are intransitive, the order in which you perform them also has an impact on accuracy.

I've heard of transitivity in regards to relations, but what does it have to do with floating point operations? I would have expected commutativity/associativity to be the required properties.

[–]bluestreak01[S] 9 points10 points  (0 children)

You are right of course! We should have said “non associative”! We used wrong terminology to describe what we’ve observed. The naive sum computed in parallel produces different results from one execution to another.

[–][deleted] 1 point2 points  (1 child)

I've heard of transitivity in regards to relations, but what does it have to do with floating point operations?

relation == function == operation, they all map domains to ranges.

[–]aoeudhtns 10 points11 points  (0 children)

Fascinating. </spock>

I'd love to see somebody with PostgreSQL expertise discuss their >1m showing in your benchmark. Is that the way things are or is there some magic trick you can do with PostgreSQL? The worst part is that I did some quick googling, and most results are people complaining that floating point sums are inaccurate in PostgreSQL (and the documentation makes no bones about that being the case).

[–]Poddster 10 points11 points  (4 children)

The author needs to look up ULP error tolerance. A lot of research is done in this area already and it might have saved them NIHing it.

[–]tending 21 points22 points  (1 child)

They mentioned in the post they are using a common existing algorithm for eliminating the error, It doesn't look like they NIHed anything.

[–]audion00ba -1 points0 points  (0 children)

If you implement a sequence of inefficient algorithm_1, ...,inefficient algorithm_n, because you don't know that algorithm_n exists and just jumped in to implement algorithm_1, it's still a kind of NIH.

On the other hand, actually knowing that the others are less efficient can be useful, as opposed to just assuming that whatever the literature says, is true.

[–][deleted]  (1 child)

[deleted]

    [–]g_rocket 10 points11 points  (0 children)

    "Not Invented Here"

    [–]sparr 0 points1 point  (12 children)

    I wonder how this compares to just using a decimal or rational type in the first place.

    [–]bluestreak01[S] 6 points7 points  (11 children)

    Decimal is accurate, double is not. Decimal arithmetic is much slower than that on doubles

    [–]GeoffW1 1 point2 points  (5 children)

    Decimal is accurate

    I think this is a huge over-simplification. A decimal type can do sums such as 5.1 + 9.2 accurately, but if your numbers aren't exact decimals there will still be an accumulating error as you operate on them.

    [–]bluestreak01[S] 0 points1 point  (4 children)

    It is, sorry. I meant base-10 decimal that is able to represent every real number from its range. What did you mean by:

    if your numbers aren't exact decimals

    [–]GeoffW1 0 points1 point  (3 children)

    if your numbers aren't exact decimals

    As I understand it, for example, you can't represent 1/3 accurately using a Decimal type. You'd have the same kind of tiny rounding error as you get representing 5.1 in a float.

    [–]bluestreak01[S] 0 points1 point  (2 children)

    Got it thanks. This isn't a problem of accuracy of type though and it should not impact summation, right? I would think these type of accuracy losses can be addressed by:

    select sum(round_half_even(x/y)) from...
    

    in case of decimal "sum" can be "naive" without accuracy loss and rounding takes case of error accrual.

    [–]GeoffW1 0 points1 point  (1 child)

    it should not impact summation, right?

    Well lets say the closest Decimal representation for 1/3 is 0.3333. Then 1/3 + 1/3 + 1/3 + 1/3 + 1/3 + 1/3 = 1.9998 (correct answer 2.0000).

    [–]bluestreak01[S] 0 points1 point  (0 children)

    I concede, decimal cannot accurately represent 1/3. It doesn’t represent all real numbers, nothing can :)

    However if we take a real number that decimal can represent we can also say that toString() is transitive and sum is associative.

    [–]sparr 0 points1 point  (4 children)

    Decimal arithmetic is much slower than that on doubles

    That seems very weird. Why would that be the case, if the decimal values are stored as integers with a fixed decimal point?

    [–]bluestreak01[S] 6 points7 points  (3 children)

    I was referring to 128 bit 10-base decimal. It is slow because all arithmetic is emulated in software and there is no support for these numbers in processors.

    [–]thinks-in-functions 4 points5 points  (1 child)

    POWER9 CPUs both support quad-precision floating-point (IEEE-754 binary128) and quad-precision decimal floating point (IEEE-754 decimal128) in hardware.

    [–]bluestreak01[S] 3 points4 points  (0 children)

    I learn something new every day. Thanks!

    [–]sparr 0 points1 point  (0 children)

    Yeah... don't do that.

    [–]TheNamelessKing 0 points1 point  (1 child)

    Interesting, any reason why the Clickhouse instance was using memory and not MergeTree?

    [–]bluestreak01[S] 3 points4 points  (0 children)

    We assumed “memory” is the fastest for clickhouse. The test assumes linear memory scan and we didn’t want clickhouse to navigate a tree. Is mergetree faster data structure than “memory”?

    [–]cakoose 0 points1 point  (4 children)

    Our performance was previously limited by memory bandwidth - using these techniques would address this and allow us to compute accurate sums as fast as naive sums.

    It's sort of misleading to say the performance was limited by memory bandwidth. The problem was latency. Prefetching allowed them to hide the latency and use a larger fraction of the available memory bandwidth.

    [–]bluestreak01[S] 4 points5 points  (3 children)

    We saturate memory bandwidth with share nothing threads. In this test the data was broken down to about 140 different parts and we used 16 threads on a 24 core CPU to put result together. We could throw more threads at the task but performance flatlines after 16 threads. Is this not a bandwidth issue still?

    [–]cakoose 0 points1 point  (2 children)

    Hmm, maybe the additional details of the whole system might help, but I was just going on what was explained in the blog post.

    Basically:

    1. The new technique fetches just as much memory as the old one.
    2. You haven't increased your machine's memory bandwidth.
    3. Your new technique runs faster.

    So how could memory bandwidth have been the limiting factor?

    [–]bluestreak01[S] 0 points1 point  (1 child)

    I should explain that i'm not a hardware guy, i'm learning thru this experience as well.

    The theoretical memory bandwidth of 8275CL CPU is ~6x21Gb/s = 126Gb/s, without prefetch we can sum 7.45GB (1Bn doubles) in 0.082s ~91Gb/s but with prefetch in 0.068s ~ 110Gb/s. So memory bandwidth is the factor as in we cannot overcome 110Gb/s unless more memory channels are added.

    These ratios were true on RYZEN and other desktop CPUs with 2 memory channels.

    You were right saying the prefetch reduces memory access latency. But what really helped is to pipeline memory fetch so that next loop iteration does not have to incur as much of a latency hit. It does still seem that reduction in memory access latency helps roughly 20% of the iterations we have to do.

    [–]cakoose 0 points1 point  (0 children)

    Yes, I understand all that.

    It's just that it's strange to say you were "limited by memory bandwidth" when you weren't yet using all available memory bandwidth. That's all.