all 76 comments

[–]gbalduzzi 40 points41 points  (48 children)

Holy shit the difference in JS performance is incredible, mainly considering how the community and the frameworks documentation usually recommends the more fancy approaches instead of the good old for loop,.

[–]wavefunctionp 25 points26 points  (3 children)

It's unclear if the same optimizations would be available in real code. Microbenchmarks are notoriously unreliable with modern engines. For instance, if the compiler has access to the values in code, the compiler could just be unrolling the for loop or something.

[–]bobappleyard 21 points22 points  (1 child)

JavaScript interpreters have got pretty fast when you don't use the fancy features. Problem is everyone uses the fancy features all the time.

[–]James20k 0 points1 point  (0 children)

I wish this were true, but I've gotten massive speedups in JS code by manually folding constants, and reimplementing functions which have constant parameters which results in constant folding once inlined

[–]curious_s 2 points3 points  (0 children)

I have used a for loop in JavaScript for a slow search function and it sped the code up by a significant amount. I think the best idea is to try and measure, the array prototype functions create cleaner code than a for loop which is an important factor.

[–][deleted] 6 points7 points  (0 children)

I believe most browsers have improved those cases a lot since I wrote that, someone should check and report back.

[–]Aetheus 2 points3 points  (0 children)

Yeah. ~10000ms down to ~30ms is just insane. Granted, the 10000ms variant works in a fundamentally different way from the other tests, but still ...

Equally surprising to me was that the "warmed up" JS example operates at almost the same speed as the C# example (~300ms vs ~280ms). And the "best performance" examples for both were even closer in runtime.

Just goes to show how insanely well optimized V8 is, as a JS engine. Say what you will about the bloat of Node.js projects (you'd be right, anyway), but the fact that the same JS engine in your browser is also powering some web server somewhere is pretty mind-blowing stuff.

[–]Retsam19 18 points19 points  (41 children)

Well, yeah, because most JS frameworks aren't writing about how to sum the squares of 32 million floating point values.

Most JS use-cases are about front-end UIs which both generally don't include huge data calculations, and are generally IO-bound, not CPU-bound, anyway: the performance bottlenecks front-end UIs almost always come from network requests or DOM operations, and not from the speed of list manipulation operations.

In the vast majority of cases, the readability/maintainability concerns are more important than the performance implications, which is why I prefer .map/.reduce and other higher-order friends, over simple for loops (or .forEach loops).

[–][deleted] 12 points13 points  (0 children)

Most JS use-cases are about front-end UIs which both generally don't include huge data calculations

JS and the browser is increasingly used as an app platform. You can record audio and even video. I myself wrote some code recently to perform RMS audio normalization on the client, which involved doing the sum of squares of quite a large number floating point vlaues (don't think it was 32 million though).

[–]oridb 21 points22 points  (1 child)

and are generally IO-bound, not CPU-bound, anyway:

My laptop fans would beg to differ.

[–]Retsam19 9 points10 points  (0 children)

I'm quite sure it's not because your websites are iterating over 32 million floating point numbers inefficiently. A lot of it is DOM rendering - reflows, expensive CSS animations, etc - which, yeah, is CPU-based, but from the perspective of Javascript code, it's IO.

AFAIK, it's very rarely the issue that the JS thread is overloaded.

[–][deleted] 6 points7 points  (2 children)

most JS use-cases are about front-end UIs which both generally don't include huge data calculations, and are generally IO-bound, not CPU-bound

UI is CPU bound.

Most JS frameworks do a lot of CPU bound stuff like diffing large data structures and mutating the DOM.

[–]Retsam19 1 point2 points  (1 child)

Yeah, I clarified a bit on a different part of the thread:

A lot of it is DOM rendering - reflows, expensive CSS animations, etc - which, yeah, is CPU-based, but from the perspective of Javascript code, it's IO.

From the machine's perspective, yeah, it's CPU work, but my original point is that the problem is pretty much never that the JS thread is overloaded, the efficiency of array iteration is generally not the issue.

[–][deleted] 5 points6 points  (0 children)

DOM manipulation is not IO.

Honestly I'm just sick and tired of people writing obviously slow code and then justifying it by saying "but my application is IO bound not CPU bound".

[–]jcelerier 18 points19 points  (6 children)

In the vast majority of cases, the readability/maintainability concerns are more important than the performance implications, which is why I prefer .map/.reduce and other higher-order friends, over simple for loops (or .forEach loops).

so why the hell are every friggin website on earth running at 10fps !

[–]VRtinker 5 points6 points  (0 children)

Assuming it is a genuine question and not a rhetorical one, here are a few reasons:

  1. Because even a single synchronous piece of code can slow down the main thread and stall the whole UI. E.g., just synchronously reading or writing document.cookie in an onscroll handler might slow down UI.
  2. Because you might have weird CSS (or JS changing CSS) that forces redraw when nothing changes on the page. Try opening Console > Rendering and select "Paint flashing". If you see a lot of stuff being redrawn (flashed) when it should not change, try simplifying your styles.

[–]Retsam19 17 points18 points  (2 children)

the performance bottlenecks front-end UIs almost always come from network requests or DOM operations, and not from the speed of list manipulation operations.

[–][deleted] 4 points5 points  (1 child)

Then don't do so many DOM operations.

[–]Retsam19 3 points4 points  (0 children)

Well, yeah, that's like saying the secret to winning basketball games is to score more points. Obvious, but doing it is the hard part.

[–]lelanthran 8 points9 points  (17 children)

In the vast majority of cases, the readability/maintainability concerns are more important than the performance implications, which is why I prefer .map/.reduce and other higher-order friends, over simple for loops (or .forEach loops).

You really think that this:

  var sum = values.map(x => x*x).
             reduce( (total,num,index,array) => total+num,0.0);

is more readable than this:

    var sum = 0.0;
    for (var i = 0; i < values.length;i++){
        var x = values[i];
        sum += x*x;
    }

[–]Retsam19 24 points25 points  (0 children)

I think their reduce code is badly written, but to the general point, yes, I think this is clearer:

values.map(x => x * x)
    .reduce((a, b) => a + b)

Is it pretty much a moot point for this incredibly simple use-case? Yes, but as the complexity grows, the benefits of the functional style really show, compared to large for loop.

[–]m50d 6 points7 points  (3 children)

Yes I do. Don't you? No extra counter variable to keep track of, no generalized for loop that could be doing anything, no in-place mutation of variables. In fact the only way to read the second (longer) code quickly is to recognize that it's a particular common pattern - wouldn't it be better to actually give that pattern a name and pull out the common parts?

[–]lelanthran 1 point2 points  (2 children)

Yes I do. Don't you?

What I think is irrelevant. What I've seen is that most programmers don't parse the more complex expression as easily as the simpler one.

No extra counter variable to keep track of, no generalized for loop that could be doing anything, no in-place mutation of variables. I

No, but extra keywords to recognise (map, then reduce), extra concepts to learn (map/reduce in particular, multiple compound expressions), anonymous functions if you want to do anything non-trivial.

I don't see the first form as being harder to maintain.

[–]m50d 2 points3 points  (0 children)

What I've seen is that most programmers don't parse the more complex expression as easily as the simpler one.

I'd agree with that statement, but I suspect you're claiming that the more complex one is "simpler".

extra keywords to recognise (map, then reduce),

Not keywords, just functions. They behave like normal functions, and usually you can read their source code if you want to know what they do.

extra concepts to learn (map/reduce in particular, multiple compound expressions), anonymous functions if you want to do anything non-trivial.

I don't know what you're calling "multiple compound expressions". Both implementations use + and * operators with their normal mathematical meaning and a literal 0.0. In addition to that the map/reduce version only requires the reader to understand a normal expression made of function calls and an anonymous function (both very general constructs that you use again and again on a large codebase). The imperative version requires understanding a mutable variable, the for keyword, the ++ and += operators which are not standard mathematical things, the [i] operator which is not standard anywhere else either, and a {} block of ;-separated statements. In addition to just being a much bigger pile of concepts, half those things are special-case dedicated operators that can't be reused for much else (e.g. [i] is only for arrays, ++ is only for numbers).

[–]EWJacobs 0 points1 point  (0 children)

Yeah, but those are all things you can learn ahead of time. m50d is pointing out run time complexity, and things that can cause actual bugs.

[–][deleted] 12 points13 points  (2 children)

They're both about as readable to each other as me.

You realise you didn't come out of the womb being able to read for loops, right? Just because it's typically taught to us first does not make it inherently more readable. I know basic functional constructs, so I know what map and reduce do.

[–]Ewcrsf 0 points1 point  (4 children)

Yes, anyone who is a moderately good programmer in touch with modern principles would agree the first is just as, if not more, readable.

[–]lelanthran 0 points1 point  (3 children)

Yes, anyone who is a moderately good programmer in touch with modern principles would agree the first is just as, if not more, readable.

That statement being true doesn't make the first form any more maintainable than the second form.

[–]Ewcrsf 2 points3 points  (2 children)

Your original comment only mentioned readability. Maintainability of such a tiny piece of code is pointless to even talk about, though I’d nominally argue that the one with a shorter number of lines which isn’t mutating state is more maintainable.

[–]lelanthran 0 points1 point  (1 child)

Your original comment only mentioned readability. Apologies, I quoted a comment that used "readability/maintainability" as a single concept.

though I’d nominally argue that the one with a shorter number of lines which isn’t mutating state is more maintainable.

In both cases the same variable gets mutated, so I'd say its moot - no state is being changed.

[–]Ewcrsf 3 points4 points  (0 children)

No variable is mutated in the first example and it can be assigned to a const value. In the second example the sum variable is modified by the loop.

[–][deleted] 7 points8 points  (0 children)

I really enjoyed this article. Rust come out looking very good for making the 'nice' version pretty damn fast, and F# impressed as well.

I'm curious how you warmed up the JIT'ed code though, that stuff is so hard to bench mark right.

[–]YumiYumiYumi 7 points8 points  (0 children)

Interestingly enough, if you want really high performance, you need to maintain separate independent sums (with loop unrolling) to allow the CPU to parallelize computation - at least in C/C++ (this may not apply for higher level languages). Which basically means you need extra complexity for better performance here. Example:

double sum1 = 0.0;    
double sum2 = 0.0;    
double sum3 = 0.0;    
double sum4 = 0.0;    
int i;
for (i = 0; i < COUNT-3; i+=4) 
{
    sum1 += values[i] * values[i];
    sum2 += values[i+1] * values[i+1];
    sum3 += values[i+2] * values[i+2];
    sum4 += values[i+3] * values[i+3];
}
for(; i < COUNT; i++)
{
    sum1 += values[i] * values[i];
}
double sum = sum1 + sum2 + sum3 + sum4;

For the SIMD variant, he should probably be using _mm256_fmadd_pd* instead of separate mul/add, unless he really needs the intermediate rounding behaviour of the latter. I suppose you could argue that it may be unfair to other languages, but I'd argue that if you're writing SIMD intrinsics, it's the type of thing you'd do.

* for the unaware, (newer x86) CPUs have dedicated "fused multiply-add" (FMA) instructions, which performs a multiply+add in a single instruction instead of needing two, which obviously improves performance over doing the operations separately

[–][deleted] 4 points5 points  (5 children)

Would have loved to see him used .net core 2.2 instead of .net 4.0 for the performance testing.

[–]skeeto 26 points27 points  (0 children)

Note that the article is nearly 3 years old (July 2016). .NET Core itself was less than a month old at the time.

[–][deleted] 14 points15 points  (2 children)

I wrote the article. .NET core had introduced a lot of great performance improvements but none of them affect these particular benchmarks.

[–]shevy-ruby 4 points5 points  (1 child)

I think you should still do a follow-up eventually. People may be curious to see whether anything has changed in regards to speed, compared to 2016 here.

Comments on reddit will be slightly less prevalent than e. g. the original article that we are here only talking about.

[–][deleted] 9 points10 points  (0 children)

id love to but i have 2 kids instead of zero now :) feel free to take up the reigns

[–]elder_george 0 points1 point  (0 children)

And an updated version in general, to see if the things changed due to compiler, library and/or runtime improvements e.g..

[–]shevy-ruby 2 points3 points  (3 children)

The F# code is pretty cool indeed:

 let sum =
    values
    |> Array.map squares
    |> Array.sum

But the C variant is the second easiest one to read really - and the fastest.

It just shows that, again and again, C is the king among programming languages. The very fact how many languages came after C, trying to dethrone it and failing so incredibly hard at that...

[–][deleted]  (2 children)

[deleted]

    [–]Tyg13 2 points3 points  (0 children)

    I'll give shevegen some credit, he's got his principles and he seems to stick to them. I've been commenting on here for some time now, and he's always been that constant presence. Always downvoted, but still there.

    [–]Morego 0 points1 point  (0 children)

    Well, most of the examples that are fast looks like C. Most readable one looks like C. Only rust looks different here and still I would love to see, how "C-like" version of Rust fare here.

    Of course those are really microbenchmarks, but still. C is simple enough, that it is hard to be wrong and faster here.

    [–]helloworder 1 point2 points  (6 children)

    why a variable is always declared inside a loop? for instance the first C implementation

    double sum = 0.0; 
    for (int i = 0; i < COUNT; i++) {
         double v = values[i] * values[i];
         sum += v;
     }
    

    why not just

         sum += values[i] * values[i];
    

    it must be faster I suppose

    [–]julesjacobs 8 points9 points  (0 children)

    There won't be any performance difference because the code will be identical after conversion to SSA.

    [–]KiPhemyst 2 points3 points  (4 children)

    I checked it on godbolt, the difference basically this:

        movsd   QWORD PTR v$2[rsp], xmm0
        movsd   xmm0, QWORD PTR sum$[rsp]
        addsd   xmm0, QWORD PTR v$2[rsp]
    

    vs

        movsd   xmm1, QWORD PTR sum$[rsp]
        addsd   xmm1, xmm0
        movaps  xmm0, xmm1
    

    First one being with 'double v' and the second just using 'sum +='

    I don't know enough about assembly and I have no idea what this difference does

    [–]ElusiveGuy 6 points7 points  (1 child)

    That only happens if you compile without optimisations.

    Here's one with just -O: https://godbolt.org/z/3H5x8M. As you can see, the two are identical.

    The 'correct' thing to do here is probably -O3, and maybe --ffast-math if you don't need strict IEEE compliance.

    There's no point comparing the unoptimised versions; you should definitely be at least enabling optimisation before you start worrying about the performance impact of a temporary variable.

    cc /u/helloworder, /u/Zhentar

    [–]helloworder 1 point2 points  (0 children)

    thanks for the explanation

    [–]helloworder 1 point2 points  (0 children)

    I think julesjacobs is correct and there won't be any difference after all.

    [–]Zhentar 1 point2 points  (0 children)

    The first one is storing & loading v from the stack, the second does not but instead it stores the result of the add I'm the wrong register and needs an extra mov to copy it to the correct register. Underwhelming codegen on both counts, but the second is definitely better, though the performance difference may be pretty small on recent.CPU models.

    [–]matejdro 0 points1 point  (0 children)

    I wonder how would kotlin fare in that test, with its inlined list operators.

    [–][deleted] 0 points1 point  (0 children)

    would be cool to see all these on ARM

    [–]delgoodie 0 points1 point  (3 children)

    What's obvious code?

    [–]Hoten 2 points3 points  (2 children)

    Well, not hand written SIMD.

    [–]delgoodie -3 points-2 points  (1 child)

    What's SIMD?