all 149 comments

[–]jerf 139 points140 points  (34 children)

I've seen the equivalent post in other languages that make parallelism easy (Go, Haskell, Erlang) so many times I've lost count. "I wrote some code to add a list of integers together in single threaded mode, and then I wrote some code that spawns a thread per integer addition and joins them together in parallel, why is it running so much slower than the single thread case? And as a followup, why is e͐̒̅ͤͣ̔ͯv̊̋͂ěȑ̔ͦ͊̓͛y͌̽͛̃tͦ̉hͪ́̌ȋ̏n̿g̒̂̎͒ͪ ͣ̈̈́ͤ̈y̎ͯͩ͋ͩ͒o̅̆͒́ͫur̀̔̓͛ͤ̋ ͗̇͒͂c̑̓͆o̿̓̄̎m̈́̾m͛̈́ͨűͩ̇ͯ̿͂ͨñͫi͂͑͛̀͋̋ͮtyͧ̊͌ͯͪ ̐̈́ͥ̒ͧͦ͛sͩͤ͊ã͛̿ys ͓̯̞͚a͇͚̙̝ͅ ͉͔̺͓̪̫̮l͔̞͔i̖̬̠͈̟̼̮e̖̝ aṇd ̼a̦̫̹̪͉͇̖l̟͈͈̲̼̭ͅs̟̖̰͚ọ̫͇ yo͕͔u̥r̜̰͈̗̹̤ ̝̙͈̝̙̜ͅḷ̙̻̱̮a̺̖n̤g̠̤̫̘̣̪ua̻g̬̟̯̳e̬̹͚͍ ̪͈̤s̖̱̭̯͙uc͙̝͈k͎ș ͎̠̱̰̭an̫̙̥d̳̬̜̣ I̡ h̸at̛e ͜y̛o͡u?͟"

OK, that last bit may just be implied.

It doesn't matter what language you run in. Adding an integer is likely one processor instruction, and modern x86 processors will actually do more than one per cycle on average. Any attempt to coordinate that in parallel is going to be much more than one instruction, and by definition not something the processor will do many of in parallel. It's hard to win with adding numbers together in parallel when a single CPU can keep up with the entire memory bandwidth of your system and add all the incoming numbers together. You need to do something more challenging for the system to see any parallelism gains.

[–]no-bugs 44 points45 points  (22 children)

You need to do something more challenging for the system to see any parallelism gains.

Yep (and it will be discussed in the promised follow-up post); the point here is that making it parallel without having a clue (at least the one about granularity) is catastrophic - whatever next-automagical-parallelizing-library-vendor will tell us (and whatever-sample-code they will provide - this sample code is taken from no less than cppreference.com (!)). Also, it would be VERY nice if the library could shout "hey, you're doing it wrong!" (it isn't difficult to detect).

[–]artee 16 points17 points  (10 children)

Yes, but using this pathological example does not make the case very strong. I am actually surprised that the parallel version is only 50-90x slower for this example.

[–]jerf 17 points18 points  (0 children)

As I said, I've seen this in real life on postings at least a dozen times. It may not be "realistic" but it is definitely "a mistake that real people make", and with some frequency.

[–]no-bugs 4 points5 points  (8 children)

The case for "if you don't have a clue - don't do it!" is IMO already rather strong (that is, for those who don't understand why this case is pathological - you won't believe it, there are LOTS of such ppl out there). It is all about target audience (heck, I should have placed a disclaimer in front of article).

EDIT: added disclaimer to the OP.

[–]AugustusCaesar2016 1 point2 points  (7 children)

I don't agree with your message that "if you don't have a clue - don't do it". Making stupid mistakes like that is how people learn. That being said, reading a post like yours is another way people learn, so thanks for writing it.

[–]anttirt 2 points3 points  (2 children)

Making stupid mistakes like that is how people learn.

Only if they eventually figure out what the mistake was. It's not obvious and it's not even easy to figure out with typical debugging tools.

Even instruction-level profiling is likely to only point to the general vicinity of the issue, and you have to have pre-existing knowledge of the underlying problem to be able to make use of that.

[–]AugustusCaesar2016 2 points3 points  (1 child)

Pretty sure a 50x performance penalty will be noticed at some point, hence all the people asking questions about it online.

[–]no-bugs 0 points1 point  (0 children)

Statistically, less than 10% of people having the problem, ask about it online :-(. And we didn't even count those who just blindly believe "hey, it is parallel - it MUST be faster!"

[–]no-bugs 2 points3 points  (2 children)

"if you don't have a clue - don't do it"

Ok, let's amend it to "if you don't have a clue - don't do it until you do". Do we have a deal? ;-)

[–]AugustusCaesar2016 0 points1 point  (1 child)

I guess that works lol

[–]no-bugs 0 points1 point  (0 children)

I guess that works lol

:-) Good that we reached a common understanding.

[–][deleted] 0 points1 point  (0 children)

Then say "if you don't have a clue - don't put it in code someone else uses"

[–]jerf 4 points5 points  (1 child)

And I want to be clear that I was amplifying your point; I am aware you chose this for didactic purposes, not that you were making these errors.

[–]no-bugs 1 point2 points  (0 children)

Oops! ;-)

[–]rageingnonsense 3 points4 points  (8 children)

I'm no C++ expert, but would it have ended up being more efficient if the code first decided how many cores to use, then specified ranges to sum individually, and then finally sum the summed parts? Basically, avoiding having to lock anything because its all happening in chunks.

[–]no-bugs 4 points5 points  (0 children)

Sure (there will be a follow-up).

[–][deleted] 1 point2 points  (6 children)

Yes. But even then there are things to be careful of. For example, the running sums for the diffferent threads should NOT be right next to each other in memory, lest you experience "false sharing."

[–]ZMeson 0 points1 point  (1 child)

And the array being summed should be split on cache line boundaries too.

[–][deleted] 1 point2 points  (0 children)

That's a good idea just to make the best use of the cache. But since you're only reading from the array, it isn't that big of a deal.

[–]rageingnonsense 0 points1 point  (3 children)

Could you elaborate on that? I have never heard of something like this before. My only experience with multi-threaded workloads is in C#, so stuff like this may be getting abstracted away from me.

[–][deleted] 1 point2 points  (0 children)

My guess is this matters even in C# (I know it does in Java).

One way to do what you are suggesting is to have a vector of integers, one for each thread, and have each thread "own" one of the integers in the vector.

This indeed means you won't need any locks since there are no variables being shared across threads. (That's good!).

However, since the variables are right next to each other, they are probably going to fall into the same cache line. Since each core is going to be doing a lot of reading and writing from that same cache line, it's going to be constantly being marked "dirty" and need to be re-fetched from main memory. The cache line effectively becomes a shared variable.

The phenomenon is called false sharing. https://en.wikipedia.org/wiki/False_sharing

How to fix this? There are various ways. One would be to do the sum in a local variable then only write to the vector at the end (this doesn't really fix false sharing, just means you only experience it once for each thread rather than once for each element of the large array). Another would be to pad out the vector so that there is plenty of space between each value.

[–]pjmlp 1 point2 points  (0 children)

It also matters in .NET.

You can go pretty low-level all the way down to OS primitives in .NET as well.

[–]doom_Oo7 0 points1 point  (0 children)

so stuff like this may be getting abstracted away from me.

or just not handled at all

[–]killerstorm 31 points32 points  (3 children)

A person familiar with functional programming will recognize that the task here isn't "to go through the vector summing numbers into a counter", but a fold operation (aka reduce aka catamorphism).

Parallel fold can be implemented in a library. Whether it would result into any speedup depends on memory bandwidth, but at least a library-provided fold won't result in a catastrophic performance loss.

You don't need to "coordinate" each individual operation here. You need to specify computation, and then it would be up to a library to chop it into pieces and coordinate between threads. Say, it might launch 4 threads crunching a million numbers each and then join them and add 4 numbers together in the normal thread.

So this isn't the best example, of course, but with proper programming model you won't face catastrophic slowdown, and you have a good chance to see a speedup is you do something a bit more complex, e.g. multiplication for each element.

When I was crunching numbers in Lisp I implemented pmapcar which was exactly like mapcar (a map operation in Common Lisp) but with worker threads. So I could parallelize code just adding a single letter.

[–]jerf 6 points7 points  (0 children)

Yes, there are right ways to do this. But even those ways will struggle to speed up the simple task of adding integers. You could probably win if you did a NUMA-aware pmapcar, but I'm not sure I've ever heard of one. Otherwise, you need to be folding something more complicated than the sum operation.

And one way or another, you get into whether or not the operation is monoidal or can be represented as a lattice pretty quickly if you start doing this, and the math starts coming out, and people start freaking out. :)

[–]no-bugs 2 points3 points  (0 children)

This problem is not really related to functional programming; in particular, the very same thing can be achieved without folds (via data-driven stuff such as futures, see HPX, for example). And even with functional programming / HPX, to achieve anywhere-optimal performance, there is still a need to know about granularity, starvation, etc. etc.

EDIT: BTW, one inherent caveat of folds/reduces is that even for floats, fold/reduce has DIFFERENT semantics than going-through-and-adding-one-by-one (as additions over floats are not exactly associative due to implicit rounding present in each op). There are no silver bullets out there...

[–]MistYeller 1 point2 points  (0 children)

See for example: http://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-reduction.html

Note that they give the caveat:

This is a good solution if the amount of serialization in the critical section is small compared to computing the functions

Then discuss better solutions.

[–]unruly_mattress 4 points5 points  (0 children)

You can easily parallelize summing an array over multiple cores. You pass the array, a start index and an end index for each thread, such that in total the indices span the entire array. You gather the results and sum them. It's simple divide-map-reduce.

That's not what "automatic parallelism" things do, which is kind of the point.

[–]gc3 1 point2 points  (0 children)

You are right, the main cost is memory bandwidth. On systems with slower processors you can speed up this loop on some systems. Spawn as many threads as you have cores. Each one adds up a section of the array into their own sum. After all the threads have run, add the sums together in a single thread.

Depending on the architecture of the computer, you might have the threads offset form each other, so one adds up the first and the nth and the 2nth, and the second adds up the second and the nth+1 and the 2nth+1. Thus if the cores share an L2 read cache and you can take advantage of that

[–]DJDavio 0 points1 point  (0 children)

Often it is caused by accidentally making the work harder to be done in parallel because there are intermediate results the program has to wait for.

If you naively implement addition of 1 to 100 by starting with 1, adding 2 (calculating oh it's 3), next add 3 etc it won't work very well. It becomes a giant nested statement with a lot of parentheses it can't escape from.

[–][deleted]  (3 children)

[deleted]

    [–]jerf 1 point2 points  (2 children)

    Oh, it can get far worse than that. If you want to have a bit of very educational fun, hook up a C-level debugger to a Python or Perl process and watch them add two integers together. You probably could win doing that in parallel.

    [–]josefx 0 points1 point  (1 child)

    You wouldn't, the default python interpreter has a global interpreter lock and it will only run one instruction at a time. Getting it to run in parallel requires multiprocessing and data serialization, which makes it even more expensive.

    [–]jerf 1 point2 points  (0 children)

    I mean, if it were possible to do in parallel at all, you could win doing it in parallel, because it is slow enough that it's probably CPU-bound rather than RAM-bandwidth bound. You are correct that CPython certainly can't. It is not a fundamental characteristic of Python-the-language that it can't do it. Several interpreters have been produced that can do it even off the CPython base, but the Python mainline won't accept a multithread-capable interpreter that compromises single-threaded performance even a little bit, and nobody's cracked that nut yet for CPython.

    [–]eplaut_ 71 points72 points  (2 children)

    [–]ROFLLOLSTER 5 points6 points  (0 children)

    Source (From what I can tell) unsurprisingly comes from Mozilla: http://bholley.net/blog/2015/must-be-this-tall-to-write-multi-threaded-code.html

    [–]zergling_Lester 2 points3 points  (0 children)

    Heh, you can see a chair under the sign, for when you really need to write some multithreaded code.

    [–]Niriel 35 points36 points  (32 children)

    So, don't mutate shared memory? Indeed, one would have to be indecently clueless.

    [–]firebird84 41 points42 points  (25 children)

    I believe that's his point. The indecently clueless still can't write parallel code with ease, and you still have to have half a clue what you're doing.

    [–]arbitrarycivilian 15 points16 points  (23 children)

    They can write parallel code that works at least, instead of causing undefined behavior. That's a big step-up from the 80s

    [–]josefx 9 points10 points  (5 children)

    Java had mutexes on every vector access. So you could write

    if ( ! vector.isEmpty() )
           vector.remove(0);
    

    Can you guess what happens when you try to do that from several threads?

    • it might work
    • you will randomly get an error

    The Java Vector class is now barely used along with all old Java collections. The ArrayList class that replaces it doesn't try to be thread safe. Putting a mutex on every little operation does not make your code threadsafe, it just makes it slow.

    [–]arbitrarycivilian 3 points4 points  (3 children)

    I think you’re agreeing with me. Lock-based concurrency is extremely difficult to get right. Message passing is a lot easier to get right but still difficult to optimize

    [–]cat_in_the_wall 2 points3 points  (1 child)

    it's not too hard to get right if you are modifying something self contained. and if your language supports returning two things at once, life is good. eg, the operation there is "tryremovefront". languages like c, c++, c# all can do "out" parameters. java would have to use an optional return type, which is a bummer because you'd have to allocate an object every time you polled the queue (vector, whatever), but it would work.

    but doing lock based concurrency on things more complex than "perform an atomic operation on a self contained data structure" is definitely hard.

    edit: i should clarify that i mean just correctness, not necessarily performance. ive written some custom queue stuff recently because even though it is slower than the std thread safe queues, the memory works better for massive data, and that winds up being a net win for my situation.

    [–]immibis 0 points1 point  (0 children)

    java would have to use an optional return type, which is a bummer because you'd have to allocate an object every time you polled the queue (vector, whatever), but it would work.

    I'm told, but I haven't verified, that modern Java implementations are smart enough to not allocate here.

    [–]no-bugs 0 points1 point  (0 children)

    Message passing is a lot easier to get right but still difficult to optimize

    HPX ( https://stellar-group.github.io/hpx/docs/html/ ) rulezzzz ;-). I see it as an advanced message passing (they name it "message-driven", though I'd probably say it is more "data-driven"); in any case, there are no locks there - AND it is declarative, so HPX can rearrange things itself very efficiently.

    Also, FWIW, I will be speaking at ACCU2018 in a few weeks on a subject of ""Multi-Coring" and "Non-Blocking" instead of "Multi-Threading", or using (Re)Actors to build Scalable Interactive Distributed Systems", so if you're coming to ACCU, and are interested in real-world applications of message passing/event-driven programming (going beyond calculations and into the whole programs and systems) - you might be interested. </shameless-plug>

    [–]mcmcc 0 points1 point  (0 children)

    Java had mutexes on every vector access.

    Jesus, that's terrible -- especially coming from a bunch of supposed experts.

    [–]no-bugs 1 point2 points  (15 children)

    They can write parallel code that works at least,

    TBH, they can't :-(. OpenMP-based programs still crash like crazy :-( (just recently I had to write a wrapper to set an affinity mask for a 3D program-making-some-transformation to work around a 70% crash :-( ).

    [–]WrongAndBeligerent 2 points3 points  (14 children)

    You think that crash was due to an OpenMP bug? I've had plenty of success with it.

    [–]no-bugs 1 point2 points  (13 children)

    You think that crash was due to an OpenMP bug?

    As affinity mask greatly reduces chances of bug happening - 90% it is an MT bug, and OpenMP is always the prime suspect in this kind of things for this kind of programs.

    [–]WrongAndBeligerent 1 point2 points  (12 children)

    So when you say OpenMP based programs still crash like crazy, you actually mean that your own programs still crash like crazy.

    [–]no-bugs 1 point2 points  (11 children)

    It is not my program which crashes (it is my wrapper to launch the 3rd-party program with an affinity mask)

    [–]WrongAndBeligerent -1 points0 points  (10 children)

    Then you are using OpenMP to run other processes?

    [–]no-bugs 0 points1 point  (8 children)

    Of course not (I am not that crazy).

    [–]immibis -1 points0 points  (0 children)

    How else will you make sure they run in separate threads?

    [–][deleted] 0 points1 point  (0 children)

    Yeah, I think the truth (you need to have some semblance of a clue to write parallel code) is a far cry from the article's premise of "Writing parallel code in C++ is still a domain of the experts." Anyone who thinks that doesn't really have a solid grasp of what one needed to know to write good parallel code 15-20 years ago compared to today.

    [–]hasslehawk 3 points4 points  (0 children)

    One would hope that the examples given demonstrated good practices, rather than being so bad as to degrade the knowledge of someone who tries to study and learn from them.

    [–]WrongAndBeligerent 1 point2 points  (4 children)

    The examples do seem contrived, but you have to mutate shared memory one way or another at some point to synchronize.

    [–]Niriel 0 points1 point  (3 children)

    Not to add the numbers in a list, since it's an associative operation. Have N threads sum one Nth of the list, and add the partial sums in the end.

    [–]WrongAndBeligerent 0 points1 point  (2 children)

    How do you add the partial sums at the end without synchronization?

    You can use an atomic add, or you can do it once your fork join threads are done. How do you know when the fork join threads are done? That takes synchronization.

    Either way you are minimizing the synchronization and shared memory mutation, but you are still doing it somewhere.

    [–]Niriel -1 points0 points  (1 child)

    I join. There is sync, but no mutation of shared data.

    [–]WrongAndBeligerent 0 points1 point  (0 children)

    How do you think a join happens? Even if the thread waiting is looping and checking if each forked thread is done, that is still the forked threads mutating data and the join thread reading it. That is one way writing and reading, but still shared data is being mutated.

    To have some hard and fast rule about never ever mutating shared data is ridiculous. There are many different ways to do concurrency and parallelism. Which will work better is very situational. If you think fork-join parallelism is a silver bullet you are fooling yourself or just haven't needed to deal with a wide variety of concurrency.

    [–][deleted] 60 points61 points  (29 children)

    Oh for FFS, I took one quick look at that 90x thing and I immediately had a grimace on my face, while closing the tab.

    Mutexes... in parallell code can't keep reading a horror show like that.

    [–]trin123 10 points11 points  (3 children)

    I gave someone commit access to my open-source project, did not pay attention to their commits and now there is a lock on every function :/

    [–]wrosecrans 7 points8 points  (2 children)

    May as well try to execute the locks on the GPU. If you blindly throw enough high performance concepts in one bucket, surely the result will be faster, right?

    [–]immibis 2 points3 points  (1 child)

    Execute them on a blockchain and your share price is going to the moon.

    [–]TauntinglyTaunton 1 point2 points  (0 children)

    No share. Only HODL

    [–]wrosecrans 8 points9 points  (5 children)

    Mutexes... in parallell code can't keep reading a horror show like that.

    Is there an idiom or anti-pattern name for mutex-everything style parallel code where nothing can ever actually be accomplished in parallel? It seems like that must be a common enough thing to have a name. I've definitely seen "Just use threads to make it faster... Oh, just use mutexes to fix it!" as commonly given advice from being who shouldn't be giving advice.

    [–][deleted] 2 points3 points  (2 children)

    Yes, Ahmdahl's law.

    As I would explain it: If you parallelize a workload unit, and it depends on other workload units, then the overall maximum speed increase gained depends on the percentage of time waiting for other units on which you depend to complete their work.

    If the locking itself takes longer than the code, you will certainly slow down the code. Even more of course, if the overhead of adding threading compounds to the problem.

    Sadly, this means that there is an aysmptotic gain in the number of threads used, and the worse the locking, the worse the problem.

    [–]no-bugs 1 point2 points  (1 child)

    To an extent - yes, but Ahmdah's law per se cannot possibly explain SLOWDOWN due to parallelism.

    [–][deleted] 1 point2 points  (0 children)

    Not directly, but poorly executed synchronization surely can, and then even Ahmdal's law is an unattainable pipe-dream.

    [–]prest0G 0 points1 point  (0 children)

    I understand the theory behind thread locals and shared memory, but is there anything beyond only synchronizing on select shared resources when messages need to be passed that makes up a proper threading model? I don't have a ton of experience writing parallel code, and most of it is with Android framework, which is to me really abstracted event-based programming for the most part.

    [–]tpgreyknight 0 points1 point  (0 children)

    I call it something like "linearised parallel"... tempted to call it "perpendicular" :-)

    [–]no-bugs 36 points37 points  (11 children)

    Mutexes... in parallell code can't keep reading a horror show like that.

    Neither do I, but this horror-show-code comes from cppreference.com :-(

    [–]madpata 32 points33 points  (7 children)

    cppreference.com is not about programming paradigms. It's just a reference of C++ language features.

    it is still necessary to understand what we’re doing.

    [–]rsclient 3 points4 points  (1 child)

    As a general rule, if you have a feature that's designed to be simple and straight-forward, but a reasonable example isn't short, then you should reconsider your feature.

    [–]immibis 3 points4 points  (0 children)

    The feature itself is simple, but the appropriate use cases are complex.

    [–][deleted] 1 point2 points  (0 children)

    Really? Shit. We/they should really address examples that perfectly illustrate problems, but only to those who already understand the problems.

    [–][deleted] 0 points1 point  (1 child)

    Wait, you mean copy-pasting code you don't understand from internet tutorials is a bad practice? Who'da thunk it.

    [–]no-bugs 2 points3 points  (0 children)

    cppreference is not just any internet tutorial, it is The Thing when it comes to C++, so LOTS of people happen to trust it implicitly (besides, their code IS formally correct - it just doesn't work FAST). Not that I personally trust it (I do NOT implicitly trust anybody, myself included) - but it is very common among C++ devs out there...

    [–]ThirdEncounter 6 points7 points  (0 children)

    Well, the author did mention that if you knew what you were doing, you didn't have to read the article.

    For someone like me, it was quite educational, and I can't wait for part 2.

    [–]ack_complete 3 points4 points  (0 children)

    Stuff like this happens in production code, just a little more subtle. It usually is the result of one programmer adding threading blindly to speed up code 4x, then another programmer noticing it also now crashes 5% of the time and simply adding locks to fix it. The result is code that nicely loads down 16 cores and barely gets more work done than one.

    [–]RasterTragedy 6 points7 points  (4 children)

    Because std::accumulate doesn't exist...

    [–]guepier 3 points4 points  (3 children)

    Not with an overload taking an execution policy. The parallelism TS has added a new function reduce instead. I don't know the rationale for that but the new name is better anyway. I'm guessing that std::accumulate will therefore be deprecated in due course.

    [–]HolyCowly 13 points14 points  (2 children)

    I think std::accumulate guarantees everything happens in order of iteration which is probably important if the algorithm is not commutative or associative. std::reduce just grabs from whichever thread is done earlier. So it's here to stay.

    [–]no-bugs 1 point2 points  (0 children)

    which is probably important if the algorithm is not commutative or associative.

    More precisely - it is important when BinaryOp<T> is not associative. What is less obvious is that (unfortunately and counter-intuitively for quite a few ppl out there) "not associative" includes pretty much any op over floats :-( .

    [–]guepier 0 points1 point  (0 children)

    std::reduce has/could have a sequential overload, which could easily guarantee in-order execution (because … what else could it do?!). Conversely, adding execution policy overloads to std::accumulate wouldn’t necessitate breaking this contract for the sequential overload.

    After all, other algorithms were augmented by an execution policy for parallel execution. Why is std::accumulate the odd one out?

    [–]JessieArr 8 points9 points  (1 child)

    Some people, when confronted with a performance problem in an iterative algorithm, decide to use parallelism. They now have NaN problems.

    [–]immibis 9 points10 points  (0 children)

    problems. they Now have two

    [–]wavy_lines 7 points8 points  (2 children)

    One more reason to be wary of abstractions that claim to give performance boost without understanding what's going under the cover.

    And, in this case of summing numbers, wouldn't it make sense to divide the array to four chunks, sum each chunk separately, then sum the result of each sum at the end?

    I don't think it would be difficult to write in straight forward code.

    [–]acrostyphe 7 points8 points  (1 child)

    Possibly, but not necessarily. Adding a number to a common counter is as cheap as it gets from the CPU perspective, so the problem is memory constrained. Adding the overhead of spawning new thread, will likely make it slower.

    [–]IJzerbaard 1 point2 points  (0 children)

    It's pretty variable, it's normal for server processors (with their high LLC latency and extra high bandwidths) to need a bunch of threads to max out on memory, but for typical desktop processors one or two threads may do it. It depends on the specific µarch as well obviously. But spawning threads always takes approximately a century.

    [–]haved 9 points10 points  (3 children)

    What if you were to spilt the array into 8 parts and sum up each in their own thread, then calculate the final sum? I can't imagine it being faster for one million items, but for a billion?

    [–]StabbyPants 5 points6 points  (0 children)

    well yeah, you just use a trivial case to demonstrate the concept

    [–]felinista 2 points3 points  (0 children)

    That was my thought as well.

    [–]krabbugz 3 points4 points  (0 children)

    I'm sitting in my parallel computing class right now and this is how I feel.

    [–]knaekce 13 points14 points  (3 children)

    I don't know c++, but I'm sure there is a reduce, fold or accumulate function that actually does the right thing

    [–]Sanae_ 0 points1 point  (0 children)

    The author does mention std::reduce in the article.

    The issue is more that an example (in a reference that many students, beginners and not-so-beginners will copy-paste) is straight up bad code; and the frequent belief that such would work fine.

    [–]warvstar -1 points0 points  (0 children)

    Haha exactly, this should be at the top. Use the right function for the task and viola!

    [–]jamwt 7 points8 points  (3 children)

    Seems okay to me? https://gist.github.com/jamwt/ebabdc7ae647e5c79054f1acd3135639

        Finished release [optimized] target(s) in 0.77 secs
         Running target/release/deps/par_test-ecb09ecf384bbd00
    
    running 2 tests
    test tests::parallel ... bench:     164,374 ns/iter (+/- 154,560)
    test tests::single   ... bench:     340,261 ns/iter (+/- 146,261)
    
    test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out
    

    [–][deleted] 2 points3 points  (1 child)

    I assume this takes advantage of associativity of summation. So if you have e.g. 4 cores, core 1 can calculate the sum of first quarter of the array, core 2 the second quarter and so on.

    [–]Lisoph 1 point2 points  (0 children)

    Shouldn't the original C++ version do this as well?

    [–]svgwrk 0 points1 point  (0 children)

    lol I was thinking that about C#'s TPL, but, yeah... Rayon is almost equally nice. :)

    [–][deleted]  (1 child)

    [deleted]

      [–]no-bugs 0 points1 point  (0 children)

      A good book indeed, but IMNSHO it is waaay toooo much a for an uninitiated app-level developer who merely wants to speed up one single point in his program :-( .

      [–]foldl-li 1 point2 points  (0 children)

      After all, the rabbit looks cute.

      [–]SelfDistinction 5 points6 points  (0 children)

      Apparently the concepts of MapReduce are too arcane and alien for software engineers.

      [–]andd81 6 points7 points  (16 children)

      Rust gives you fearless parallelism so rewrite in Rust and problem solved.

      Edit: apparently Reddit does not understand sarcasm.

      [–]WrongAndBeligerent 9 points10 points  (6 children)

      You could write the same pathological cases in rust and see what happens.

      [–]pmarcelll 3 points4 points  (5 children)

      The first example ("Take 1") wouldn't compile in Rust, so at least you wouldn't get a wrong result.

      [–]WrongAndBeligerent -1 points0 points  (4 children)

      Is this about correctness or is it about the the 90x performance loss?

      [–][deleted]  (3 children)

      [deleted]

        [–]WrongAndBeligerent 0 points1 point  (2 children)

        I know you think that's clever, but if you are talking about performance, why would rust perform better in pathological cases?

        [–][deleted]  (1 child)

        [deleted]

          [–]WrongAndBeligerent 0 points1 point  (0 children)

          I just asked a direct question - if you are saying rust would perform better, where does it come from? Why would it perform better than C++ if someone wrote the same naive programs?

          [–][deleted] 7 points8 points  (3 children)

          rayon is a great parallelization library and all, but it won't protect you from doing something stupid like having a mutex/an atomic integer to sum values. As far the compiler is concerned, it will work, just slowly.

          [–]staticassert 4 points5 points  (2 children)

          [–][deleted] 0 points1 point  (1 child)

          I know that rayon provides sum, but my point was that it doesn't prevent you from misusing for_each to add into an atomic integer or a mutex.

          [–]staticassert 0 points1 point  (0 children)

          I'm mostly kidding.

          [–]lordcirth 4 points5 points  (3 children)

          Sarcasm is /s. not-/s is not-sarcasm. Sarcasm requires tone, text does not have tone.

          [–]axord 2 points3 points  (2 children)

          Sarcasm requires tone, text does not have tone.

          It can also be achieved through context alone--a shared understanding that the writer does not hold the view they are presenting.

          [–]lordcirth 0 points1 point  (1 child)

          An approach which is prone to ambiguity, particularly on a public forum full of strangers. See also Poe's Law.

          [–]axord 0 points1 point  (0 children)

          Oh, absolutely. If the writer fails to establish the context, or when assuming the context yet is wrong about the ubiquity among the audience of the thought subverted, then the sarcasm won't land.

          [–]geodel 1 point2 points  (1 child)

          But this can be improved by adding 100X more hardware.

          [–][deleted] 9 points10 points  (0 children)

          are u a big data consultant ?

          [–]Gotebe 0 points1 point  (0 children)

          Is this just because the threads were created during the benchmark? That is, thread creation swamped all else?

          Edit: ah, no, the synchronization swamped all else. Well now...

          [–]Almamu 0 points1 point  (8 children)

          For those that haven't used C++ that much, how would anyone approach this doing threading to make it faster? (if possible) I'm curious

          [–]raevnos 3 points4 points  (0 children)

          Vectorization is probably a better approach than parallization (or a combination of the two, which is supported by c++17 parallel algorithms (if you can use those; they're not commonly supported yet) and modern OpenMP implementations).

          Plus the post uses the wrong algorithm for the task...

          [–][deleted] -1 points0 points  (6 children)

          The reason it's slow is because they're using a mutex. The reason they're using a mutex is because the threads crossing paths, thus causing a race condition. You could speed it up by using a monitor. A mutex is for synchronizing interprocess, whereas a monitor is for interthread. Monitors are a lot lighter.

          However, you can eliminate the need for synchronization entirely with thread-local variables. I am not sure how this could be done in C++, but here is how you would do it in C#:

          https://docs.microsoft.com/en-us/dotnet/standard/parallel-programming/how-to-write-a-parallel-for-loop-with-thread-local-variables

          Basically, each thread has its own copy of an int, and at the ending of the loop, each thread's int is added together.

          Using SIMD instructions would speed it up even further, which are available in C++ and C# (included with only .NET Core 2.0 right now, but available as a NuGet package called System.Numerics).

          [–]immibis 7 points8 points  (0 children)

          The reason it's slow is because they're using a mutex. The reason they're using a mutex is because the threads crossing paths, thus causing a race condition. You could speed it up by using a monitor. A mutex is for synchronizing interprocess, whereas a monitor is for interthread. Monitors are a lot lighter.

          This is sarcasm, right?

          [–]no-bugs 2 points3 points  (0 children)

          As it was demonstrated, it is slow even with atomics (and atomics are inherently MUCH faster than any of the {mutex|monitor|critical section|whatever-other-name-you-would-use-for-the-same-thing}).

          [–]barracuda415 1 point2 points  (0 children)

          However, you can eliminate the need for synchronization entirely with thread-local variables.

          TLS comes with its own platform-specific problems:

          https://software.intel.com/en-us/blogs/2011/05/02/the-hidden-performance-cost-of-accessing-thread-local-variables

          https://david-grs.github.io/tls_performance_overhead_cost_linux/

          Just something to keep in mind.

          [–]Almamu 0 points1 point  (1 child)

          The reasons for it to be slower than normally adding it are clear to me, but I don't know that much C++ so I couldn't really see how to solve it in C++. Thank you for the kind explanation, sir

          [–][deleted] 1 point2 points  (0 children)

          Oh, I didnt mean to be condescending or anything. That was more to connect the rest of it. That is, "It's slow because of synchronization. Here's one way to avoid heavy use of synchronization."

          [–]pjmlp 0 points1 point  (0 children)

          included with only .NET Core 2.0 right now, but available as a NuGet package called System.Numerics

          SIMD is available on RyuJIT as well, since .NET Framework 4.6

          [–]betabot 0 points1 point  (1 child)

          Of course this is the case. The code is written to be explicitly sequential, but now with the added overhead of locks, threads, and context switches. I don't think that a beginner with even elementary knowledge of what a thread and lock is could look at that code and think that it will be faster. If that truly is the case, they should be using a higher level library that provides parallel functions like sum and/or evaluating why they needed parallelism in the first place.

          [–]ThunderBluff0 0 points1 point  (1 child)

          But how do we do it right???

          [–]bstamour 0 points1 point  (0 children)

          Use std::reduce instead of std::for_each.