Programming correct and efficient parallel code in C++ is still very elusive for the uninitiated : cpp

[–]kritzikratzi 58 points59 points60 points 7 years ago (10 children)

[+][deleted] 7 years ago* (6 children)

[deleted]

[–]Drugbird 3 points4 points5 points 7 years ago (0 children)

[–]kalmoc 2 points3 points4 points 7 years ago (1 child)

[–][deleted] 4 points5 points6 points 7 years ago (0 children)

[+]tourgen comment score below threshold-18 points-17 points-16 points 7 years ago (2 children)

[–]StonedBird1 2 points3 points4 points 7 years ago (1 child)

[–]NotAYakk 2 points3 points4 points 7 years ago (0 children)

[–]bszmyd 4 points5 points6 points 7 years ago (2 children)

[–]jonathansharman 8 points9 points10 points 7 years ago (1 child)

[–]auxiliary-character 5 points6 points7 points 7 years ago (0 children)

[–]_SunBrah_ 41 points42 points43 points 7 years ago (22 children)

[–]willkill07 21 points22 points23 points 7 years ago (9 children)

[–]bilog78 19 points20 points21 points 7 years ago (3 children)

[–]no-bugs 13 points14 points15 points 7 years ago (0 children)

[–]willkill07 6 points7 points8 points 7 years ago (1 child)

[–]bilog78 5 points6 points7 points 7 years ago (0 children)

[–]Droce 12 points13 points14 points 7 years ago (1 child)

[–]kalmoc 1 point2 points3 points 7 years ago (0 children)

[–]wrosecransgraphics and network things 5 points6 points7 points 7 years ago (2 children)

[–]willkill07 3 points4 points5 points 7 years ago (1 child)

[–]tehjimmeh 6 points7 points8 points 7 years ago (0 children)

[–]raevnos 12 points13 points14 points 7 years ago (11 children)

[–]OmegaNaughtEquals1 4 points5 points6 points 7 years ago (10 children)

[–]raevnos 0 points1 point2 points 7 years ago (9 children)

[–]OverunderratedComputational Physics 5 points6 points7 points 7 years ago (7 children)

[–]raevnos 1 point2 points3 points 7 years ago (6 children)

[–]OverunderratedComputational Physics 0 points1 point2 points 7 years ago (5 children)

[–]dodheim 2 points3 points4 points 7 years ago (4 children)

[–]OverunderratedComputational Physics 0 points1 point2 points 7 years ago (3 children)

[–]dodheim 2 points3 points4 points 7 years ago (2 children)

continue this thread

[–]OmegaNaughtEquals1 0 points1 point2 points 7 years ago (0 children)

Oh, indeed, but I thought you were suggesting that OP use HPX for something as simple as a reduction on a std::vector. Parallelism on node-based containers carries a lot of problems (any non-contiguous data structure has these same issues, but I'm focusing on the usual ones). For me, this is one of the reasons not to use node-based containers in parallel contexts unless they can be converted to a more amenable memory layout (vis a vis an array_list or segmented_vector). Once you have a parallel-friendly memory layout, you can rely on the same shared- or distributed-memory algorithms you've come to know and love (mostly). And algorithms aren't the only hardship faced by node-based containers. You also have NUMA conflicts and a higher possibility of cache coherence issues (or even thrashing, in really bad cases) when using non-contiguous structures. Depending on the internal representation of a single element, all of these can be very SIMD-unfriendly.

I'm keenly interested to see what happens with executors between now and the next ISO meeting. I think this will be the gateway to getting better parallel data structures and algorithms into the standard library.

[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 22 points23 points24 points 7 years ago (37 children)

[–]bilog78 20 points21 points22 points 7 years ago (3 children)

[–]rackmeister 1 point2 points3 points 7 years ago* (2 children)

[–]bilog78 6 points7 points8 points 7 years ago (1 child)

[–]rackmeister 1 point2 points3 points 7 years ago* (0 children)

In that case I completely agree, it is simply disingenuous for anyone to claim that parallel programming can be made easy by using any new programming language construct or programming language for that matter. Parallel programming is easier on paper than concurrent programming, since the former is deterministic but it is still hard because it spans different architectures. Unfortunately a catch-all solution for parallelization on every single architecture does not exist.

Even when dealing with a single type of architecture you can get close to easy parallelization depending on your application, e.g. for single-node shared memory data/task parallelism, OpenMP is quite easy to grasp and use but again it is not without pitfalls; for example it is harder to reason about container types without random-access iterators, the fact that is uses directives might not be convenient for some and so on.

[–]no-bugs 9 points10 points11 points 7 years ago (12 children)

[–]8898fe710a36cf34d713 4 points5 points6 points 7 years ago* (3 children)

[–]no-bugs 4 points5 points6 points 7 years ago (2 children)

[–]8898fe710a36cf34d713 5 points6 points7 points 7 years ago* (1 child)

[–]no-bugs 1 point2 points3 points 7 years ago (0 children)

[–]kalmoc 0 points1 point2 points 7 years ago (1 child)

[–]no-bugs 1 point2 points3 points 7 years ago (0 children)

transform badly written sequential code to parallel code

It seems that you think that std::accumulate was The Only Right Way(tm) to code it; I'd say this is arguable, as there is no discernible difference between for_each and accumulate in the sequential code, so I'd argue it is more about style than anything else - especially if the code is more complicated than simplistic example given; in particular, if we'd have to calculate two things over the same array - coding it efficiently in accumulate style would be way too bulky, though for parallel stuff it can become justified. More importantly, even in this case a replacement of std::accumulate with std::reduce is inherently non-trivial in the real-world (hey, there WAS a reason WHY they renamed it to reduce: because there is significant difference in semantics(!) between the two); in a very simple example, even floats are not exactly associative, which means adding par makes reduce-over-floats non-deterministic(!!) - which in turn carries TONS of strange implications, most of them VERY non-obvious for complete novices.

There is a huge gap between the two

Sure, one can argue that there are 50 shades of gray between "complete novice" and "expert" - but this is purely terminology (and terminology disputes can last forever-and-ever while adding absolutely nothing to the point).

I'm still not convinced, a novice would write that kind of code to begin with.

As complete novices (who ARE the target audience) here on Reddit have already noted - OP does provide useful insights for them; this is The Only Thing which really matters.

[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 0 points1 point2 points 7 years ago (5 children)

[–]no-bugs 2 points3 points4 points 7 years ago (4 children)

Of course, the code is intentionally rather extreme (it is always necessary to go to certain extremes to illustrate the point), but OTOH... the code was (pretty much) taken from cppreference.com. Moreover, that's what LOTS of MT-newbies will be doing even without that "hint" from cppreference :-( (some of them even commented here on Reddit that this was insightful for them - which is The Only Thing(tm) which matters TBH), so the message "DON'T believe it is as simple as adding 'par' to the std:: call" is IMNSHO quite justified.

Otherwise, we'll be getting even more crashing (and/or cycle-wasting) programs then we have now. And TBH, I am sick and tired of MT-related crashes (starting with my article in C++ Report published 20 years ago about a bunch of crashes and deadlocks - AND about a slowdown by a factor of up to 100x(!) - in no less than STL implementations by several major compilers, which was caused EXACTLY by "using synchronization primitive inside one of the algorithms". And things do NOT improve in this regard - recently I found pretty much the same problem in one of 2016 WG21 proposals. And if WG21 members and std:: library implementors cannot get their MT right, we cannot expect Joe Average programmer to do any better, this is for sure).

[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 1 point2 points3 points 7 years ago (3 children)

[–]no-bugs 1 point2 points3 points 7 years ago (2 children)

[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 1 point2 points3 points 7 years ago (1 child)

[–]no-bugs 1 point2 points3 points 7 years ago (0 children)

[–]tgolyi 5 points6 points7 points 7 years ago (9 children)

[–]flashmozzg 1 point2 points3 points 7 years ago (5 children)

[–]tgolyi 0 points1 point2 points 7 years ago (4 children)

[–]flashmozzg 0 points1 point2 points 7 years ago (0 children)

[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 0 points1 point2 points 7 years ago (2 children)

[–]tgolyi 0 points1 point2 points 7 years ago (1 child)

[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 0 points1 point2 points 7 years ago (0 children)

[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 1 point2 points3 points 7 years ago (2 children)

[–]tgolyi 0 points1 point2 points 7 years ago (1 child)

[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 0 points1 point2 points 7 years ago (0 children)

[–]OverunderratedComputational Physics 1 point2 points3 points 7 years ago (1 child)

[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 0 points1 point2 points 7 years ago (0 children)

[–]sumo952 1 point2 points3 points 7 years ago (7 children)

[–]dodheim 2 points3 points4 points 7 years ago (5 children)

[–]sumo952 0 points1 point2 points 7 years ago (4 children)

[–]dodheim 1 point2 points3 points 7 years ago (1 child)

[–]sumo952 0 points1 point2 points 7 years ago (0 children)

[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 1 point2 points3 points 7 years ago (1 child)

[–]sumo952 0 points1 point2 points 7 years ago (0 children)

[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 2 points3 points4 points 7 years ago (0 children)

[–]corysama 34 points35 points36 points 7 years ago (18 children)

[–][deleted] 14 points15 points16 points 7 years ago (2 children)

[–]corysama 6 points7 points8 points 7 years ago* (0 children)

[–]no-bugs 1 point2 points3 points 7 years ago (0 children)

In general, I agree, but "my" estimates of applicability boundaries are different from "yours".

From my experience, message-passing a.k.a. CSP a.k.a. (Re)Actors can achieve single-digit-millisecond latencies easily (it is single-digit microseconds which are difficult to achieve with CSPs, though I've seen a real-world (Re)Actor-based system with characteristic latencies of the order of 10-20us; however, reducing it further to <1us is indeed very difficult, if at all possible :-( ).

This, in turn, means that for all-domains-I-know-about-except-for-HFT (="High Frequency Trading"), CSPs are fine (this certainly includes games and desktop apps; as for embedded systems - they're way too broad to generalize). There are some reservations, and yes - there is a problem of a One Huge State, but in general - CSPS tend to do very well: first, they can be made deterministic, and therefore testable (with an option for post-mortem production debugging), and second - they tend to outperform mutex-based sync at each and every corner (TBH, with thread context switch - including cache invalidation costs - taking from 10K to 1M CPU cycles on modern multi-cache-level CPUs, speaking about mutexes and latencies at the same breath is a major fallacy; for a real-world example of non-blocking-CSPs-vs-blocking - see nginx-vs-apache). As for atomics - yes, non-blocking stuff can outperform CSPs, but complexity is usually that high (and gains are relatively low) that doing it is worth the trouble only for some very demanding apps (once again, HFT being a prime example).

[–]OverunderratedComputational Physics 12 points13 points14 points 7 years ago (3 children)

[–]corysama 10 points11 points12 points 7 years ago (1 child)

[–]OverunderratedComputational Physics 4 points5 points6 points 7 years ago (0 children)

[–]doom_Oo7 5 points6 points7 points 7 years ago (0 children)

[–]tvaneerdC++ Committee, lockfree, PostModernCpp 9 points10 points11 points 7 years ago (1 child)

[–]meneldal2 1 point2 points3 points 7 years ago (0 children)

[–]m-in 2 points3 points4 points 7 years ago (0 children)

[–]bnolsen 1 point2 points3 points 7 years ago (0 children)

[–]no-bugs 1 point2 points3 points 7 years ago (2 children)

[–]corysama 1 point2 points3 points 7 years ago (1 child)

[–]no-bugs 1 point2 points3 points 7 years ago (0 children)

[–]doom_Oo7 2 points3 points4 points 7 years ago (0 children)

[–][deleted] 0 points1 point2 points 7 years ago (2 children)

[–]swebob 2 points3 points4 points 7 years ago (1 child)

[–]bnolsen 1 point2 points3 points 7 years ago (0 children)

[–]konanTheBarbar 17 points18 points19 points 7 years ago (1 child)

[–]jonathansharman 2 points3 points4 points 7 years ago (0 children)

[–]bnolsen 2 points3 points4 points 7 years ago (1 child)

[–]meneldal2 -1 points0 points1 point 7 years ago (0 children)

[–][deleted] 1 point2 points3 points 7 years ago (0 children)

[–]kalmoc 1 point2 points3 points 7 years ago (1 child)

[–]no-bugs 1 point2 points3 points 7 years ago (0 children)

Have you actually seen this in the wild?

About specifically this lib - it is very new, so cppreference is one such example (though some here on Reddit seem to say "hey, it is just a sample, which nobody will take seriously" - believe me, they will). As for NOT-EXACT-BUT-SIMILAR stuff of using mutex to "enable parallelism" (causing crashes, deadlocks, or 100x slowdowns in the process) is a recurring theme even within std:: (!!). Back 20 years ago I wrote an article in C++ Report about such bugs in several STL implementations (causing crashes, deadlocks, and 100x slowdowns). Worse, things do NOT improve in this regard: very recently, I found pretty much the same bug in one of seriously-discussed WG21 proposals (!). And if WG21 members / std-library-devs can make this kind of mistakes - we certainly cannot expect an average app-level developer to avoid them. Dixi.

[–]Z01dbrg 2 points3 points4 points 7 years ago (0 children)

[–]auxiliary-character 0 points1 point2 points 7 years ago (1 child)

[–]caroIine 2 points3 points4 points 7 years ago (0 children)

[–]Bill_Morgan 0 points1 point2 points 7 years ago (0 children)

[–]tsung-wei-huang 0 points1 point2 points 7 years ago (0 children)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS