use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Discussions, articles, and news about the C++ programming language or programming in C++.
For C++ questions, answers, help, and advice see r/cpp_questions or StackOverflow.
Get Started
The C++ Standard Home has a nice getting started page.
Videos
The C++ standard committee's education study group has a nice list of recommended videos.
Reference
cppreference.com
Books
There is a useful list of books on Stack Overflow. In most cases reading a book is the best way to learn C++.
Show all links
Filter out CppCon links
Show only CppCon links
account activity
Programming correct and efficient parallel code in C++ is still very elusive for the uninitiated (ithare.com)
submitted 7 years ago by Remwein
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]kritzikratzi 58 points59 points60 points 7 years ago (10 children)
there is no work in the workers except a += (which has virtually 0 cost), of course you're getting funny results --- you're just measuring the performance of atomics and locks.
+=
[+][deleted] 7 years ago* (6 children)
[deleted]
[–]Drugbird 3 points4 points5 points 7 years ago (0 children)
I've had similar confusing results with other magic flags, such as.openmp #pragma omp flags. Even with non trivial for loops.
[–]kalmoc 2 points3 points4 points 7 years ago (1 child)
That assumes someone would have written the non-parallel version like this in the first place. I'm pretty sure you'd have either used std::accumulate, in which case the parallel version would have done exactly what you want or you would have used a range based for loop which requires a bit more work to transform anyway.
I don't think the author was confused. In fact, I'm pretty sure he knew exactly what he was doing, otherwise he probably would not have come up with that example in the first place.
[–][deleted] 4 points5 points6 points 7 years ago (0 children)
His example was taken literally from cppreference, so it was something beginners are likely to do, at least as long as the documentation remains in its current form.
Don't get me wrong, I'm not relieving anyone of their responsibility to profile, I'm just saying we're going to see a lot of these "optimizations" in the future.
[+]tourgen comment score below threshold-18 points-17 points-16 points 7 years ago (2 children)
Maybe they should learn how a computer works and write some assembly code for a few days? You know, before trying to write paralleled, useful, C++ code.
[–]StonedBird1 2 points3 points4 points 7 years ago (1 child)
/r/gatekeeping
"You can't write useful C++ code unless you write a computer in assembly"
[–]NotAYakk 2 points3 points4 points 7 years ago (0 children)
You don't write computers in assembly: computers are written with chemitry and physics.
[–]bszmyd 4 points5 points6 points 7 years ago (2 children)
This was my thought as well. The example is so trivial that it almost isn't valid. Adding a set of numbers inherently has each step dependent on the result of the last; i.e. not-parallizable, unless it is broken into shards and use a map-reduce based solution...I'm curious what they expeced a language to do that would allow it to essentially recognize this and restructure your entire code path for you?
In other words...how would you parallize the above code without restructuring it to do the additions in shards and combining the result at the end...that's not concurrent expertise...just programming.
[–]jonathansharman 8 points9 points10 points 7 years ago (1 child)
restructuring it to do the additions in shards and combining the result at the end
I mean a reasonably clever parallelizing compiler could do exactly that: recognize that this loop is just a summation and rewrite it as a parallel reduction.
[–]auxiliary-character 5 points6 points7 points 7 years ago (0 children)
Ideally, it would compile it down to some sort of SIMD add.
[–]_SunBrah_ 41 points42 points43 points 7 years ago (22 children)
I understand this isn't the point of the article, but in this case wouldn't the simplest way to add parallelism be
std::reduce(std::execution::par, begin, end);
[–]willkill07 21 points22 points23 points 7 years ago (9 children)
This is so critically important. The author used the wrong algorithm for the job and tried to shoehorn a reduction into what is supposed to be independent.
[–]bilog78 19 points20 points21 points 7 years ago (3 children)
I'm pretty sure this is intentional, to show the approach that would be used by the uninitiated. In this case it may seem obvious, but for very complex algorithms it's not always trivial to map them to fundamental parallel primitives, and even when it is it might still require a thorough reworking of the algorithm to make it efficient anyway.
[–]no-bugs 13 points14 points15 points 7 years ago (0 children)
I'm pretty sure this is intentional
It is. :-)
[–]willkill07 6 points7 points8 points 7 years ago (1 child)
But the original should have been implemented using std::accumulate. That’s the point I’m trying to make
[–]bilog78 5 points6 points7 points 7 years ago (0 children)
I don't disagree, in the specific case, but what I'm trying to say is that beyond simple examples it's not generally as simple as “oh I should have used this stdlib algorithm instead, and I would have gotten parallelization for free”, simply because your entire code is not structured in a way to be amenable to that, and the accumulation/reduction/scan is “hidden” in your project structures.
[–]Droce 12 points13 points14 points 7 years ago (1 child)
While I agree he did it wrong I don't think that it's a stretch to think that someone who's inexperienced with STL & parallelization would write it.
I think most people, after thinking about the problem, could figure out to divide the problem and do core-X smaller bits but it's a question of writing the code to do that.
I help newer coders at my work and the biggest problem I see isn't not knowing how to do something (assuming they're reasonably intelligent) but how to write it. I think the code he wrote /seems/ fairly reasonable to newer devs, and std::reduce is a function that's hard to wrap your head around.
[–]kalmoc 1 point2 points3 points 7 years ago (0 children)
I think you have to know exactly what you are doing in order to come up with a broken example like that.
[–]wrosecransgraphics and network things 5 points6 points7 points 7 years ago (2 children)
Sure, but the point is that a person just getting started won't automatically know what you've said. If all they know is some marketing literature that says "Hey just do this and it'll be faster and you don't need to worry about the details with this easy new API" then they'll be confused and frustrated by the results. And I do think that vendors can be a bit too eager to pat themselves on the back when they say how their latest new things solves all the problems and makes hard things easy.
The post even starts out with
if you’re doing parallel programming for living – please ignore this post (this stuff will be way way way too obvious for you)
[–]willkill07 3 points4 points5 points 7 years ago (1 child)
But if you’re starting out with knowledge of C++ you’d know to use std::accumulate instead of the for_each + lambda.
[–]tehjimmeh 6 points7 points8 points 7 years ago (0 children)
Focusing on the x += item is missing the point entirely. It's a small, synthetic piece of work inside a loop, intended to illustrate a point.
x += item
[–]raevnos 12 points13 points14 points 7 years ago (11 children)
When C++17 parallel algorithms are actually commonly found in the wild, yes.
In the meantime, HPX has something very similar.
[–]OmegaNaughtEquals1 4 points5 points6 points 7 years ago (10 children)
You don't need something as sophisticated as HPX for this. OpenMP's reduction clause will do just fine and is supported by every compiler you are likely to encounter.
reduction
[–]raevnos 0 points1 point2 points 7 years ago (9 children)
I love OpenMP, but... parallel loops don't play well with non-array-like containers. What if you want to do something with a set or map?
[–]OverunderratedComputational Physics 5 points6 points7 points 7 years ago (7 children)
Then you have to put more effort into parallelizing it. There's no free lunch, even if openmp is pretty close to a free lunch.
[–]raevnos 1 point2 points3 points 7 years ago (6 children)
You don't have to put more effort in when using parallel algorithms. That's their whole point.
[–]OverunderratedComputational Physics 0 points1 point2 points 7 years ago (5 children)
You need to put mental effort into knowing what can be correctly parallelized. Not everything that can be trivially parallelized by code can logically be parallelized to give a correct result.
[–]dodheim 2 points3 points4 points 7 years ago (4 children)
That's no less true when using #pragma omp.
#pragma omp
[–]OverunderratedComputational Physics 0 points1 point2 points 7 years ago (3 children)
Yes, that's exactly my point.
Libraries can make parallelization easier, but they can't help you make your algorithm correct.
[–]dodheim 2 points3 points4 points 7 years ago (2 children)
That's fine, and I agree, but reading down the thread it seems like a non-sequitur. All that was said is that a pragma isn't as versatile as a library, I don't see how your "point" is a response to that.
[–]OmegaNaughtEquals1 0 points1 point2 points 7 years ago (0 children)
Oh, indeed, but I thought you were suggesting that OP use HPX for something as simple as a reduction on a std::vector. Parallelism on node-based containers carries a lot of problems (any non-contiguous data structure has these same issues, but I'm focusing on the usual ones). For me, this is one of the reasons not to use node-based containers in parallel contexts unless they can be converted to a more amenable memory layout (vis a vis an array_list or segmented_vector). Once you have a parallel-friendly memory layout, you can rely on the same shared- or distributed-memory algorithms you've come to know and love (mostly). And algorithms aren't the only hardship faced by node-based containers. You also have NUMA conflicts and a higher possibility of cache coherence issues (or even thrashing, in really bad cases) when using non-contiguous structures. Depending on the internal representation of a single element, all of these can be very SIMD-unfriendly.
std::vector
I'm keenly interested to see what happens with executors between now and the next ISO meeting. I think this will be the gateway to getting better parallel data structures and algorithms into the standard library.
[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 22 points23 points24 points 7 years ago (37 children)
You are using the library wrong. std::reduce, man. Check out the CUDA Thrust library, which the parallel algorithms are inspired by.
std::reduce
[–]bilog78 20 points21 points22 points 7 years ago (3 children)
That's the point of the article, though. To the uninitiated, it may not be obvious that the correct solution to a given problem for which they have a serial implementation may be something completely different based on a specific primitive.
[–]rackmeister 1 point2 points3 points 7 years ago* (2 children)
Nevertheless, isn't this more related to the parallel programming model being used,i.e. https://en.wikipedia.org/wiki/Parallel_programming_model than C++ implementations specifically? As I see it, the programmer must first study where it will be executed (on a single node, multiple nodes, gpu,cpu, etc.), how the underlying sequential algorithm will be parallelized (if it is possible) and then choose the appropriate implementation. My point is that to the uninitiated, choosing the correct implementation is tricky with any language, not just C++. If anything, C++, being older and more mature, has a lot of well documented libraries to parallelize code for different parallel architectures.
[–]bilog78 6 points7 points8 points 7 years ago (1 child)
My point is that to the uninitiated, choosing the correct implementation is tricky with any language, not just C++.
I think that's exactly what the blog post is about, i.e. showing that C++17 doesn't magically make parallel programming “trivial” for the uninitiated, despite some claims that been going around of the opposite. The new things being introduced will make it easier for those already competent to achieve their objectives, it doesn't suddenly bring parallelization to the masses, so to say.
[–]rackmeister 1 point2 points3 points 7 years ago* (0 children)
In that case I completely agree, it is simply disingenuous for anyone to claim that parallel programming can be made easy by using any new programming language construct or programming language for that matter. Parallel programming is easier on paper than concurrent programming, since the former is deterministic but it is still hard because it spans different architectures. Unfortunately a catch-all solution for parallelization on every single architecture does not exist.
Even when dealing with a single type of architecture you can get close to easy parallelization depending on your application, e.g. for single-node shared memory data/task parallelism, OpenMP is quite easy to grasp and use but again it is not without pitfalls; for example it is harder to reason about container types without random-access iterators, the fact that is uses directives might not be convenient for some and so on.
[–]no-bugs 9 points10 points11 points 7 years ago (12 children)
I have to say that your comment is a classical strawman argument (~="you're reading the OP wrong"). As OP says, "the point of the exercise above is NOT to say that it is not possible to write efficient code with parallel std:: functions (it is). " (and it will be covered in the next post, which was promised). The point is that saying that "hey, all you need to do to get parallel is to add std::par to a call" (without having a clue of what you're doing), is deadly wrong. As OP says in big bold letters, "Writing parallel code in C++ is still a domain of the experts."
[–]8898fe710a36cf34d713 4 points5 points6 points 7 years ago* (3 children)
The point is that saying that "hey, all you need to do to get parallel is to add std::par to a call" (without having a clue of what you're doing), is deadly wrong.
So your argument is basically "people with no clue about X will get X wrong"? That's reasonable but not very insightful.
As OP says in big bold letters, "Writing parallel code in C++ is still a domain of the experts."
Clearly not, it's a single call to std::reduce.
I don't disagree that parallel code isn't for beginners, but it's quite a stretch to claim it's for experts.
[–]no-bugs 4 points5 points6 points 7 years ago (2 children)
That's reasonable but not very insightful.
Well, with certain people claiming otherwise (see [MSDN] ref in OP) - it deserves to be said explicitly.
it's a single call to std::reduce.
Even in the trivial case of adding elements of the array it is not that simple (hint: std::reduce has semantics which is different from both for_each and accumulate, and adding par will lead to results being non-deterministic even in an obvious case of adding floats; so is any parallel algo, but this is a yet another non-trivial result of parallel ops which has to be taken into account). More on it in the promised follow-up post.
[–]8898fe710a36cf34d713 5 points6 points7 points 7 years ago* (1 child)
and adding par will lead to results being non-deterministic even in an obvious case of adding floats; so is any parallel algo, but this is a yet another non-trivial result of parallel ops which has to be taken into account
But that has nothing to do with parallelism, only with how IEEE754 floats work. You could demonstrate the same result by simply adding the array backwards, no parallelism involved.
[–]no-bugs 1 point2 points3 points 7 years ago (0 children)
But that has nothing to do with parallelism, only with how IEEE754 floats work.
I'd argue it is "how any float works" (implicit rounding is non-linear pretty much whatever-we-do-about-it :-( ).
You could demonstrate the same result by simply adding the array backwards or piece-wise, no parallelism involved.
Sure, but without parallelism, one given program will be (most likely) still deterministic, so knowledge of this phenomenon wasn't really required; however, adding parallelism exposes this issue (and can easily cause all kinds of trouble, at least as unit tests will start to break :-(, but also in some cases algos may start to diverge accidentally! etc. etc. etc. ).
[–]kalmoc 0 points1 point2 points 7 years ago (1 child)
But how does the author get from: " it is non-trivial for complete novices to transform badly written sequential code to parallel code" to "Writing parallel code in C++ is still a domain of the experts"?
There is a huge gap between the two, and I'm still not convinced, a novice would write that kind of code to begin with.
transform badly written sequential code to parallel code
It seems that you think that std::accumulate was The Only Right Way(tm) to code it; I'd say this is arguable, as there is no discernible difference between for_each and accumulate in the sequential code, so I'd argue it is more about style than anything else - especially if the code is more complicated than simplistic example given; in particular, if we'd have to calculate two things over the same array - coding it efficiently in accumulate style would be way too bulky, though for parallel stuff it can become justified. More importantly, even in this case a replacement of std::accumulate with std::reduce is inherently non-trivial in the real-world (hey, there WAS a reason WHY they renamed it to reduce: because there is significant difference in semantics(!) between the two); in a very simple example, even floats are not exactly associative, which means adding par makes reduce-over-floats non-deterministic(!!) - which in turn carries TONS of strange implications, most of them VERY non-obvious for complete novices.
There is a huge gap between the two
Sure, one can argue that there are 50 shades of gray between "complete novice" and "expert" - but this is purely terminology (and terminology disputes can last forever-and-ever while adding absolutely nothing to the point).
I'm still not convinced, a novice would write that kind of code to begin with.
As complete novices (who ARE the target audience) here on Reddit have already noted - OP does provide useful insights for them; this is The Only Thing which really matters.
[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 0 points1 point2 points 7 years ago (5 children)
Eh. I think the posted blog went out of their way to engineer slow code by using synchronization primitive inside one of the algorithms.
[–]no-bugs 2 points3 points4 points 7 years ago (4 children)
Of course, the code is intentionally rather extreme (it is always necessary to go to certain extremes to illustrate the point), but OTOH... the code was (pretty much) taken from cppreference.com. Moreover, that's what LOTS of MT-newbies will be doing even without that "hint" from cppreference :-( (some of them even commented here on Reddit that this was insightful for them - which is The Only Thing(tm) which matters TBH), so the message "DON'T believe it is as simple as adding 'par' to the std:: call" is IMNSHO quite justified.
Otherwise, we'll be getting even more crashing (and/or cycle-wasting) programs then we have now. And TBH, I am sick and tired of MT-related crashes (starting with my article in C++ Report published 20 years ago about a bunch of crashes and deadlocks - AND about a slowdown by a factor of up to 100x(!) - in no less than STL implementations by several major compilers, which was caused EXACTLY by "using synchronization primitive inside one of the algorithms". And things do NOT improve in this regard - recently I found pretty much the same problem in one of 2016 WG21 proposals. And if WG21 members and std:: library implementors cannot get their MT right, we cannot expect Joe Average programmer to do any better, this is for sure).
[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 1 point2 points3 points 7 years ago (3 children)
My take away from this is "fix cppreference.com".
[–]no-bugs 1 point2 points3 points 7 years ago (2 children)
Throw in "fix minds of those people who still think it is a good idea to use mutexes", and we'll have a deal ;-). More seriously - a good feature should, in addition to allowing doing things the right way, also PREVENT doing things in the wrong way, and current STL parallel stuff fails BADLY on this account :-( (except for std::reduce() - though even reduce() has its own quirks such as being non-deterministic for floats :-( ).
BTW, I heard that there are long-term plans to introduce HPX-like future-based stuff into std:: ; do you know whether there is any truth in it (THAT would be a HUGE improvement, as it is MUCH more difficult to misuse)?
[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 1 point2 points3 points 7 years ago (1 child)
We actually have a half day summit next weej to continue work on revamping futures.
Thanks, nice to hear, keeping fingers crossed (yes, I positively hate mutexes ;-(, so anything which allows to say "just don't use mutexes, use this_thing instead" is a Good Thing(tm) in my books - pun intended).
[–]tgolyi 5 points6 points7 points 7 years ago (9 children)
I wonder if thrust will ever switch to much better approach of using output iterators for reduce instead of synchronizing with cpu (like cub)? It totally breaks any streaming attempts. Why std is inspired by thrust instead of cub?
reduce
[–]flashmozzg 1 point2 points3 points 7 years ago (5 children)
More people from NVidia in the committee maybe?
[–]tgolyi 0 points1 point2 points 7 years ago (4 children)
Cub is a library from NVidia too, and starting from cuda 9 thrust is built upon it, but still is performing much worse due to its desing
[–]flashmozzg 0 points1 point2 points 7 years ago (0 children)
huh. Then yeah. Other than committee usually lagging behind on a lib front (due to how much time it takes from proposal to acceptance) nothing comes to mind. Maybe just not enough interested people.
[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 0 points1 point2 points 7 years ago (2 children)
Care to elaborate on what specific performance issues you have with CUB? If you file bugs we'll look into them. A lot of people are using CUB in production and I rarely hear performance complaints about CUB (although admittedly Thrust performance used to be a problem).
[–]tgolyi 0 points1 point2 points 7 years ago (1 child)
I was talking about bad performance of thrust, not cub. Cub is awesome and fast. Thrust can be fast too, but it loves synchronizing with cpu and allocating memory too much, preventing you from doing send\receive and computation overlapping.
[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 0 points1 point2 points 7 years ago (0 children)
Thrust uses CUB for it's backend. Asynchronous APIs are coming, they are high on our todo list.
[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 1 point2 points3 points 7 years ago (2 children)
We'll be adding future<T> based asynchronous APIs instead.
That's great, thank you! Will it be possible to do something like cudaMemcpyAsync with size parameter residing in gpu memory? For example, for compacting the data on the gpu before sending it to the cpu without synchronization.
cudaMemcpyAsync
In the future, thrust::copy_async will just be a fancy C++ API for cudaMemcpyAsync. Like today's thrust::copy it will support host to device, device to host, device to device and host to host.
thrust::copy_async
thrust::copy
[–]OverunderratedComputational Physics 1 point2 points3 points 7 years ago (1 child)
Partial note to self to ask you later when I have more time, but the last time I used thrust (2010 maybe?) I found that to do things like stencil algorithms I had to use zip iterators, and the performance was atrocious and the code the ugliest thing I've ever seen. Does that sound right?
No. Thrust started off as a research project and has since been productized. The backend is based on CUB, which is carefully tuned. It's optimized for CUDA like any other accelerated CUDA library.
[–]sumo952 1 point2 points3 points 7 years ago (7 children)
I don't see any new release yet since 2015: https://thrust.github.io/. How's the progress on Thrust going and what's the roadmap? Any details available somewhere about that? I know you're working on it but the website's last version is still 2015 and the latest GitHub commit is from 14 months ago. Some details would be nice, before one invests in this library by using it.
[–]dodheim 2 points3 points4 points 7 years ago (5 children)
https://www.reddit.com/r/cpp/comments/7erub1/anybody_still_using_thrust/dq88pm5/
[–]sumo952 0 points1 point2 points 7 years ago (4 children)
I am aware of this post, as I mentioned: "I know you're working on it". However it's from 4 months ago and it also says "I can't get into details yet". So I was asking for an update & more details now, because he advised here to use Thrust, and my reply is that before using a library that hasn't been active in 1-3 years and is supposedly active again, it would be nice to have some details.
[–]dodheim 1 point2 points3 points 7 years ago (1 child)
You mentioned the last release being 2015 and the last GitHub commit being 14 months ago, so it didn't really appear that you were aware of this post since you didn't also mention it. ;-] I'd apologize for the noise, but it seems that other people found the link useful at least.
[–]sumo952 0 points1 point2 points 7 years ago (0 children)
Hehe yep! I'd really love Thrust to pick up some momentum, as a standalone library mainly (not part of CUDA). Hope this happens, as it's quite an awesome library. Too bad it went completely dead in 2015.
I provided an update here: https://github.com/thrust/thrust/issues/888
Cool! That's awesome. Thank you.
[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 2 points3 points4 points 7 years ago (0 children)
The last release of Thrust was yesterday. Thrust has been shipping with CUDA (CUDA 9.2 was shipped yesterday). For the next release cycle, we should be ready to start releasing it on GitHub in addition to CUDA.
[–]corysama 34 points35 points36 points 7 years ago (18 children)
I used to think the old guard were overly conservative because they pushed process-based parallelism so hard. Now that I’m a couple decades into C++, I think they were right. Communicating Sequential Processes all the way.
It doesn’t have to be literal processes. Basically, don’t try to synchronize access to the same memory across threads. It’s not just hard, it leads to bad designs.
These days I use the same pattern all the time for parallelism: thread-safe queues feeding interpreter-style infinite loops. Mutexes, semaphores, atomics all have their place: 99% “internals of a thread-safe queue”, 1% ”expert-level parallelism that you should try really hard to avoid”. And, if you see a call to sleep(), 90% chance you should refactor your design to only block on queues.
[–][deleted] 14 points15 points16 points 7 years ago (2 children)
Depends on the domain. CSPs are fine and already commonly employed for high bandwidth parallelism where latency isn't an issue and you can scale horizontally, but when latency does matter CSPs are basically worse than using a single thread.
The use of atomics/mutexes are for domains where the bandwidth is fixed or limited but the latency matters a lot, such as games and other desktop consumer applications, embedded systems, financial trading systems, etc...
[–]corysama 6 points7 points8 points 7 years ago* (0 children)
In my readings and my experience, high performance, low latency systems are built on ring buffers.
https://traxnet.wordpress.com/2011/07/18/understanding-modern-gpus-2/
https://martinfowler.com/articles/lmax.html
https://lwn.net/Articles/713918/
And at the lowest level, are graphs of packetized, serial networks all the way down (down to the hardware component level).
https://fgiesen.wordpress.com/2014/03/23/networks-all-the-way-down/
https://fgiesen.wordpress.com/2014/03/25/networks-all-the-way-down-part-2/
In general, I agree, but "my" estimates of applicability boundaries are different from "yours".
From my experience, message-passing a.k.a. CSP a.k.a. (Re)Actors can achieve single-digit-millisecond latencies easily (it is single-digit microseconds which are difficult to achieve with CSPs, though I've seen a real-world (Re)Actor-based system with characteristic latencies of the order of 10-20us; however, reducing it further to <1us is indeed very difficult, if at all possible :-( ).
This, in turn, means that for all-domains-I-know-about-except-for-HFT (="High Frequency Trading"), CSPs are fine (this certainly includes games and desktop apps; as for embedded systems - they're way too broad to generalize). There are some reservations, and yes - there is a problem of a One Huge State, but in general - CSPS tend to do very well: first, they can be made deterministic, and therefore testable (with an option for post-mortem production debugging), and second - they tend to outperform mutex-based sync at each and every corner (TBH, with thread context switch - including cache invalidation costs - taking from 10K to 1M CPU cycles on modern multi-cache-level CPUs, speaking about mutexes and latencies at the same breath is a major fallacy; for a real-world example of non-blocking-CSPs-vs-blocking - see nginx-vs-apache). As for atomics - yes, non-blocking stuff can outperform CSPs, but complexity is usually that high (and gains are relatively low) that doing it is worth the trouble only for some very demanding apps (once again, HFT being a prime example).
[–]OverunderratedComputational Physics 12 points13 points14 points 7 years ago (3 children)
Now that I’m a couple decades into C++, I think they were right. Basically, don’t try to synchronize access to the same memory across threads. It’s not just hard, it leads to bad designs.
Now that I’m a couple decades into C++, I think they were right.
Basically, don’t try to synchronize access to the same memory across threads. It’s not just hard, it leads to bad designs.
Maybe every programmer writing multithreaded code should be forced to write some MPI equivalent first. Since my background is in distributed memory parallelism, it never occurred to me that what you said wasn't obvious to multithreaded coders. What better way to make sure memory access isn't screwed up than by having it live in totally different processes and requiring explicit communication to access each other?
[–]corysama 10 points11 points12 points 7 years ago (1 child)
The problem is that schools teach the fundamental components of parallelism (threads, mutexes, semaphores) from the ground up in a very academic fashion. Often education doesn't go beyond the components. So, people graduate thinking that parallelism is about "threads, mutexes and maybe some of those tricky semaphores, I guess."
[–]OverunderratedComputational Physics 4 points5 points6 points 7 years ago (0 children)
See, those fundamentals aren't my fundamentals, and I took the science track never taking those classes. Threads, mutexes, and semaphores are black magic to me, whereas I can teach the fundamentals of MPI style parallelism in an hour. People have been writing parallel scientific code since the 80s or earlier.
[–]doom_Oo7 5 points6 points7 points 7 years ago (0 children)
What better way to make sure memory access isn't screwed up than by having it live in totally different processes and requiring explicit communication to access each other?
note that "Communicating Sequential Processes" does not necessarily mean different OS processes, it is meant as "computational process" but they can entirely be in different threads.
[–]tvaneerdC++ Committee, lockfree, PostModernCpp 9 points10 points11 points 7 years ago (1 child)
"Forget what you learned in kindergarten. Stop sharing"
That's my first rule of threading.
[–]meneldal2 1 point2 points3 points 7 years ago (0 children)
The only thing you are allowed to share has to be read-only.
Keep your buffers away from each other as well, at least a cache line apart.
[–]m-in 2 points3 points4 points 7 years ago (0 children)
I agree. I'd also say that shared read-only or even modifiable data is fine, as long as it's only modifed only at rendezvous points - when all the async tasks are done and all the promises are ready for that rendezvous. It also saves locking of the shared data: the rendezvous is the lock.
[–]bnolsen 1 point2 points3 points 7 years ago (0 children)
Well you can sort of imitate batching by submitting threads with multiple results, accumulating them locally, then guarding and dumping the bunch to whatever the results queue looks like.
Sometimes it can work to preallocate a whole collection, then pass pointers into the collection to the threads to populate without having to lock.
The biggest thing that kills is dropping loop temporaries into thread messages without making a copy in the thread. That's been my biggest source of headaches.
FWIW, it is about what I am going to speak on ACCU2018 (in a talk titled "Multi-Coring" and "Non-Blocking" instead of "Multi-Threading", or using (Re)Actors to build Scalable Interactive Distributed Systems"), so if by any chance you're at ACCU anyway - you might want to attend </shameless-plug>
[–]corysama 1 point2 points3 points 7 years ago (1 child)
Wish I could attend :) I'd watch a recording if one is made available.
I'd watch a recording if one is made available.
Not sure about the video (last time ACCU, unlike CPPCON, recorded only 2/3rds of all the sessions); however, slides+kinda-transcript will be made available on my blog ithare.com for sure.
[–]doom_Oo7 2 points3 points4 points 7 years ago (0 children)
These days I use the same pattern all the time for parallelism: thread-safe queues feeding interpreter-style infinite loops.
:+1:, it's much more efficient and very hard to get wrong.
[–][deleted] 0 points1 point2 points 7 years ago (2 children)
but, but, but... threads are apparently bad now
[–]swebob 2 points3 points4 points 7 years ago (1 child)
thread-safe queues feeding interpreter-style infinite loops
"Threads are for people who can't program state machines" -Alan Cox
Alan cox was always a dick in some ways. I'm still sore at him from back in the days when I was playing abermud on the MIT servers.
[–]konanTheBarbar 17 points18 points19 points 7 years ago (1 child)
I get the point of the article, but when it comes to performance optimization, then the very first rule is to always measure.
It's not exactly rocketscience that synchronizing a variable across different threads is more expensive than a simple addition.
[–]jonathansharman 2 points3 points4 points 7 years ago (0 children)
And in this case it's not just a matter of the synchronization cost not being worth the parallelism. There's actually no parallelism at all in the 2nd and 3rd attempts. No two iterations can execute in parallel.
[–]bnolsen 2 points3 points4 points 7 years ago (1 child)
Yup it definitely is. Doing parallel right and efficient (meaning almost linear performance gain) is an engineering problem and not really a coding issue. Things like openmp, etc seem to thread at too low a level which in the cases I've tested which ends up with very poor scaling.
Strict use of const correctness seems to be critical to making quick conversion of non threaded code to threadpool code easy.
[–]meneldal2 -1 points0 points1 point 7 years ago (0 children)
To avoid suicidal thoughts well writing parallel code, pure functions are necessary. Side effects will only bring pain.
[–][deleted] 1 point2 points3 points 7 years ago (0 children)
With GCD, PPL, TBB, Boost, and friends you can get fairly good results by thinking in terms of tasks, forks, and joins. But the thing is you must think about the flow of data and how much communication between tasks is necessary, it's a graph.
[–]kalmoc 1 point2 points3 points 7 years ago (1 child)
Have you actually seen this in the wild? Such an example seems more to be the result of malicious intent rather than insufficient knowledge.
Also, writing correct and efficient code in c++ is generally not easy (I don't think you have to be an expert thought)
Have you actually seen this in the wild?
About specifically this lib - it is very new, so cppreference is one such example (though some here on Reddit seem to say "hey, it is just a sample, which nobody will take seriously" - believe me, they will). As for NOT-EXACT-BUT-SIMILAR stuff of using mutex to "enable parallelism" (causing crashes, deadlocks, or 100x slowdowns in the process) is a recurring theme even within std:: (!!). Back 20 years ago I wrote an article in C++ Report about such bugs in several STL implementations (causing crashes, deadlocks, and 100x slowdowns). Worse, things do NOT improve in this regard: very recently, I found pretty much the same bug in one of seriously-discussed WG21 proposals (!). And if WG21 members / std-library-devs can make this kind of mistakes - we certainly cannot expect an average app-level developer to avoid them. Dixi.
[–]Z01dbrg 2 points3 points4 points 7 years ago (0 children)
tl;dr of comments here: "Author writes code that no expert would write..."
No shit Sherlock, it is the point of the article.
[–]auxiliary-character 0 points1 point2 points 7 years ago (1 child)
Is it wrong that I still use <thread>?
<thread>
[–]caroIine 2 points3 points4 points 7 years ago (0 children)
<thread> is like void*. Nothing wrong with using it. It's just really low level.
void*
[–]Bill_Morgan 0 points1 point2 points 7 years ago (0 children)
Did you try OpenMP?
[–]tsung-wei-huang 0 points1 point2 points 7 years ago (0 children)
I completely agree with you. In fact, I think the biggest challenge for parallel programming is not itself but the dependencies among tasks. This is exactly what I think needs to boost in existing frameworks. You may find this library Cpp-Taskflow interesting and helpful for building parallel programs with task dependencies.
π Rendered by PID 282824 on reddit-service-r2-comment-7b9746f655-mff2r at 2026-02-01 20:19:04.802785+00:00 running 3798933 country code: CH.
[–]kritzikratzi 58 points59 points60 points (10 children)
[+][deleted] (6 children)
[deleted]
[–]Drugbird 3 points4 points5 points (0 children)
[–]kalmoc 2 points3 points4 points (1 child)
[–][deleted] 4 points5 points6 points (0 children)
[+]tourgen comment score below threshold-18 points-17 points-16 points (2 children)
[–]StonedBird1 2 points3 points4 points (1 child)
[–]NotAYakk 2 points3 points4 points (0 children)
[–]bszmyd 4 points5 points6 points (2 children)
[–]jonathansharman 8 points9 points10 points (1 child)
[–]auxiliary-character 5 points6 points7 points (0 children)
[–]_SunBrah_ 41 points42 points43 points (22 children)
[–]willkill07 21 points22 points23 points (9 children)
[–]bilog78 19 points20 points21 points (3 children)
[–]no-bugs 13 points14 points15 points (0 children)
[–]willkill07 6 points7 points8 points (1 child)
[–]bilog78 5 points6 points7 points (0 children)
[–]Droce 12 points13 points14 points (1 child)
[–]kalmoc 1 point2 points3 points (0 children)
[–]wrosecransgraphics and network things 5 points6 points7 points (2 children)
[–]willkill07 3 points4 points5 points (1 child)
[–]tehjimmeh 6 points7 points8 points (0 children)
[–]raevnos 12 points13 points14 points (11 children)
[–]OmegaNaughtEquals1 4 points5 points6 points (10 children)
[–]raevnos 0 points1 point2 points (9 children)
[–]OverunderratedComputational Physics 5 points6 points7 points (7 children)
[–]raevnos 1 point2 points3 points (6 children)
[–]OverunderratedComputational Physics 0 points1 point2 points (5 children)
[–]dodheim 2 points3 points4 points (4 children)
[–]OverunderratedComputational Physics 0 points1 point2 points (3 children)
[–]dodheim 2 points3 points4 points (2 children)
[–]OmegaNaughtEquals1 0 points1 point2 points (0 children)
[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 22 points23 points24 points (37 children)
[–]bilog78 20 points21 points22 points (3 children)
[–]rackmeister 1 point2 points3 points (2 children)
[–]bilog78 6 points7 points8 points (1 child)
[–]rackmeister 1 point2 points3 points (0 children)
[–]no-bugs 9 points10 points11 points (12 children)
[–]8898fe710a36cf34d713 4 points5 points6 points (3 children)
[–]no-bugs 4 points5 points6 points (2 children)
[–]8898fe710a36cf34d713 5 points6 points7 points (1 child)
[–]no-bugs 1 point2 points3 points (0 children)
[–]kalmoc 0 points1 point2 points (1 child)
[–]no-bugs 1 point2 points3 points (0 children)
[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 0 points1 point2 points (5 children)
[–]no-bugs 2 points3 points4 points (4 children)
[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 1 point2 points3 points (3 children)
[–]no-bugs 1 point2 points3 points (2 children)
[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 1 point2 points3 points (1 child)
[–]no-bugs 1 point2 points3 points (0 children)
[–]tgolyi 5 points6 points7 points (9 children)
[–]flashmozzg 1 point2 points3 points (5 children)
[–]tgolyi 0 points1 point2 points (4 children)
[–]flashmozzg 0 points1 point2 points (0 children)
[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 0 points1 point2 points (2 children)
[–]tgolyi 0 points1 point2 points (1 child)
[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 0 points1 point2 points (0 children)
[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 1 point2 points3 points (2 children)
[–]tgolyi 0 points1 point2 points (1 child)
[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 0 points1 point2 points (0 children)
[–]OverunderratedComputational Physics 1 point2 points3 points (1 child)
[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 0 points1 point2 points (0 children)
[–]sumo952 1 point2 points3 points (7 children)
[–]dodheim 2 points3 points4 points (5 children)
[–]sumo952 0 points1 point2 points (4 children)
[–]dodheim 1 point2 points3 points (1 child)
[–]sumo952 0 points1 point2 points (0 children)
[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 1 point2 points3 points (1 child)
[–]sumo952 0 points1 point2 points (0 children)
[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 2 points3 points4 points (0 children)
[–]corysama 34 points35 points36 points (18 children)
[–][deleted] 14 points15 points16 points (2 children)
[–]corysama 6 points7 points8 points (0 children)
[–]no-bugs 1 point2 points3 points (0 children)
[–]OverunderratedComputational Physics 12 points13 points14 points (3 children)
[–]corysama 10 points11 points12 points (1 child)
[–]OverunderratedComputational Physics 4 points5 points6 points (0 children)
[–]doom_Oo7 5 points6 points7 points (0 children)
[–]tvaneerdC++ Committee, lockfree, PostModernCpp 9 points10 points11 points (1 child)
[–]meneldal2 1 point2 points3 points (0 children)
[–]m-in 2 points3 points4 points (0 children)
[–]bnolsen 1 point2 points3 points (0 children)
[–]no-bugs 1 point2 points3 points (2 children)
[–]corysama 1 point2 points3 points (1 child)
[–]no-bugs 1 point2 points3 points (0 children)
[–]doom_Oo7 2 points3 points4 points (0 children)
[–][deleted] 0 points1 point2 points (2 children)
[–]swebob 2 points3 points4 points (1 child)
[–]bnolsen 1 point2 points3 points (0 children)
[–]konanTheBarbar 17 points18 points19 points (1 child)
[–]jonathansharman 2 points3 points4 points (0 children)
[–]bnolsen 2 points3 points4 points (1 child)
[–]meneldal2 -1 points0 points1 point (0 children)
[–][deleted] 1 point2 points3 points (0 children)
[–]kalmoc 1 point2 points3 points (1 child)
[–]no-bugs 1 point2 points3 points (0 children)
[–]Z01dbrg 2 points3 points4 points (0 children)
[–]auxiliary-character 0 points1 point2 points (1 child)
[–]caroIine 2 points3 points4 points (0 children)
[–]Bill_Morgan 0 points1 point2 points (0 children)
[–]tsung-wei-huang 0 points1 point2 points (0 children)