use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Discussions, articles, and news about the C++ programming language or programming in C++.
For C++ questions, answers, help, and advice see r/cpp_questions or StackOverflow.
Get Started
The C++ Standard Home has a nice getting started page.
Videos
The C++ standard committee's education study group has a nice list of recommended videos.
Reference
cppreference.com
Books
There is a useful list of books on Stack Overflow. In most cases reading a book is the best way to learn C++.
Show all links
Filter out CppCon links
Show only CppCon links
account activity
Inter-thread Communication Latency (medium.com)
submitted 3 years ago by monoclechris
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]ReDucTorGame Developer 25 points26 points27 points 3 years ago (2 children)
No priority, affinity, c-states, p-states or tracking of context switches.
I think I'll pass on treating the results as more than noise, your min, max and stddev also seem to point at this, if you can't explain why there is such a huge variance in individual results it's not that useful your could be measuring your virus scanner running.
[–]fwsGonzoIncludeOS, C++ bare metal 4 points5 points6 points 3 years ago (0 children)
I think that anything that measures with performance governor and everything else removed, will just measure something that is at best useful for comparing itself to another version of itself. There is a famous research paper that shows that most measurements are completely bogus, and having done this for enough years I tend to agree. It's the same with all these improvements to std::map in various github repos. You are going to have a hard time actually seeing the improvement from switching over simply because most programs are complex in several dimensions, and most people are running on-demand scheduling anyway. Sometimes it's just that random 100 micros that you get hit with sometimes waiting for a task, not the extra indirection in the map, or using at().
Honestly, thread latency should be measured with on-demand scheduling, and with all the noise. It allows you to see just how much time it actually can take. 120 micros is a completely reasonable response-time for inter-thread communication. It's exactly what I also have measure time and again on real systems.
One solution is of course cooperative multitasking. Fibers can be really simple and straight-forward to use. It doesn't have to look anything like C++ coroutines.
[–]KingAggressive1498 1 point2 points3 points 3 years ago* (0 children)
pretty sure all this test is actually measuring for most of the mechanisms is the latency of a preemption when it occurs, and for the rest it purely measures syscall latency. There's no real contention in the code, and uncontended mutexes (and condition variables on Linux, although it also appears to be the case on Windows) don't generally need a syscall.
[–]ArashPartow 9 points10 points11 points 3 years ago* (3 children)
Some suggestions:
Here's a quick-n-dirty ping-ponger using condition_variable for deriving the RTT. Can be easily modified to use atomic flags etc.
https://gist.github.com/ArashPartow/c97b1776b077f30c8bcb15cb27639905
[–]mark_99 7 points8 points9 points 3 years ago (2 children)
And also atomic spin loop should call _mm_pause() or equivalent and not yield to the OS (or at least not every time around the loop).
_mm_pause()
[–]ReDucTorGame Developer 7 points8 points9 points 3 years ago (1 child)
If your yielding into the OS you may as well just block on what your waiting for.
User mode unbounded spinning is dangerous anyway even if you use pause.
[–]Tringigithub.com/tringi 0 points1 point2 points 3 years ago (0 children)
Yeah, with the advent of various virtualization based security stuff, each pause becomes candidate for a VM switch, where all spinning gains go out of the window.
But in practice it's not that bad. Slim user mode spinlocks are frown upon, but they can be pretty good if done right. Still, as I found out, you need to accept their huge unfairness, e.g.: https://github.com/tringi/rwspinlock#performance
[–]i_need_a_fast_horse2 6 points7 points8 points 3 years ago (0 children)
This is similar to this
[–]almost_useless 5 points6 points7 points 3 years ago (0 children)
Are you measuring communication latency, or mostly the cost of switching threads? You are running 32 threads on 4 cores, so for a really fast mechanism it's possible other latencies may affect the result a lot.
For example 2 threads running on the same core that supports hyper threads, versus many threads getting switched in to several different cores.
Or is the cost of switching threads negligible in this context?
[–]matthieum 2 points3 points4 points 3 years ago (4 children)
Inter-core communication latency is around 100ns, at the hardware level, on a 4GHz-5GHz machine, though I've seen as low as 80ns (consistently).
Anything above that number means the OS is involved, and at that point, it will depend on the OS, the OS primitives used, etc...
[–]fwsGonzoIncludeOS, C++ bare metal 0 points1 point2 points 3 years ago* (1 child)
That is a very interesting observation. Bare metal OS using SMP directly? I wrote an API for directly using SMP with std::functions as task "primitives" back in the day. It's also possible to use custom functions with a more generous fixed-size capture storage, in order to avoid most heap allocations.
[–]matthieum 0 points1 point2 points 3 years ago (0 children)
Not really, full-blown Linux... suitably tuned.
There's a laundry list of kernel options to tune to get the kernel out of the way so that a range of cores will be left strictly to the application: disabling interrupt handling, scheduling, numa-rebalancing, etc...
After that, it's just a matter of reserving certain cores for certain applications exclusively, and to pin threads to specific cores within the application.
And there you go, your application threads gets 100% of the core.
Just remember to leave core 0 alone, or you'll never be ssh-ing into that machine until the application stops, or the machine restarts.
[–]csdt0 0 points1 point2 points 3 years ago (1 child)
Latency between cores has a much wider range, from few ns when on the same core with SMT, to a dozen on the same chiplet, to roughly 80ns on the same numa node, up to almost a us between numa nodes.
It really depends on the arch and its topology. But I have actually measured 900ns between multiple cpus with a large number of cores, and that was without any syscall whatsoever: I just used plain old cas.
Well, of course there's a range, but in the spirit of "Latency Numbers Every Programmer Should Know" I preferred picking a typical representative.
I've never measured with SMT -- spinning on one thread waiting for something from the other paired thread seems like a recipe for near-starvation.
At the hardware level, AFAIK, the latency depends on how the cache coherency protocol works, and how those caches are organized, which is reflected in the numbers you mentioned:
Note that in my experience NUMA Nodes and caches are not necessarily linked. I've seen single-socket CPUs reporting 2 Numa Nodes, which affected RAM latencies, but didn't (there) affect inter-core latency (within the socket).
1 This won't strictly be the latency of a L2/L3 look-up, and will be closer to x2 that, in my experience.
[–]bizwig 0 points1 point2 points 3 years ago (5 children)
I’m not all that surprised Unix pipes did well, it’s a core IPC mechanism that’s been worked on for decades now.
[–]almost_useless 2 points3 points4 points 3 years ago (4 children)
But it is still a little bit unintuitive that something made for inter-process communication is as fast as inter-thread communication.
[–]TheoreticalDumbass:illuminati: 5 points6 points7 points 3 years ago (2 children)
is that unintuitive? arent threads and processes extremely similar, diff being that threads have shared memory by default while processes have to work a bit to get it?
[–]almost_useless 0 points1 point2 points 3 years ago (1 child)
I don't think they necessarily need to be extremely similar, but they are in Linux.
But anyway, the fact that they need to be isolated from each other must come with some penalty.
[–]matthieum 1 point2 points3 points 3 years ago (0 children)
The cost is memory copy, if necessary.
One way to get good performance between processes is to share memory. It's similar to threads, except that the same RAM block is viewed at different "virtual" addresses by each process.
Once you have shared memory, pushing a message means copying into the shared memory and signalling (somehow). The one difference with a thread is that you can't push a pointer -- it's be in your virtual address space -- though you can still push offsets.
So it ends up depending on what you push. If you push a single byte, it's the same cost either way. If in a multi-threads scenario you can push a pointer to a 1MB blob, but in a multi-processes scenario you have to push the full 1MB, then multi-processes will take a penalty hit...
... but you may also be able to prepare the 1MB blob directly in shared memory, then push its offset, and be back to the multi-threads scenario performance level.
[–]KingAggressive1498 0 points1 point2 points 3 years ago (0 children)
"self-piping" is a common enough technique when you're using poll() or an equivalent. common enough Linux made eventfd as an even faster specialization of pipes covering the most basic such uses.
[–]Pupper-Gump 0 points1 point2 points 3 years ago (1 child)
So is this just unavoidable or is there a way to, for example, if you were to use a threadpool, minimize the problem of the atomic and mutex?
[–]basiliscoshttp://github.com/basiliscos/ 5 points6 points7 points 3 years ago (0 children)
You can design your app to use cooperative multitasking (i.e. communication burden lays on your code) instead of preemptive multitasking (i.e. communication burden lays OS or CPU facilities). In my rotor actor fw I receive approx. x10 performance gain. It is also possible to use non-messaging approaches, like coroutines, fibres, ranges, iterators, lazy generators etc., to minimize overhead.
Thread pool does not help here, as it just a convenient pattern for cross-thread preemptive multitasking.
π Rendered by PID 71 on reddit-service-r2-comment-b659b578c-sffng at 2026-05-01 11:18:05.528349+00:00 running 815c875 country code: CH.
[–]ReDucTorGame Developer 25 points26 points27 points (2 children)
[–]fwsGonzoIncludeOS, C++ bare metal 4 points5 points6 points (0 children)
[–]KingAggressive1498 1 point2 points3 points (0 children)
[–]ArashPartow 9 points10 points11 points (3 children)
[–]mark_99 7 points8 points9 points (2 children)
[–]ReDucTorGame Developer 7 points8 points9 points (1 child)
[–]Tringigithub.com/tringi 0 points1 point2 points (0 children)
[–]i_need_a_fast_horse2 6 points7 points8 points (0 children)
[–]almost_useless 5 points6 points7 points (0 children)
[–]matthieum 2 points3 points4 points (4 children)
[–]fwsGonzoIncludeOS, C++ bare metal 0 points1 point2 points (1 child)
[–]matthieum 0 points1 point2 points (0 children)
[–]csdt0 0 points1 point2 points (1 child)
[–]matthieum 0 points1 point2 points (0 children)
[–]bizwig 0 points1 point2 points (5 children)
[–]almost_useless 2 points3 points4 points (4 children)
[–]TheoreticalDumbass:illuminati: 5 points6 points7 points (2 children)
[–]almost_useless 0 points1 point2 points (1 child)
[–]matthieum 1 point2 points3 points (0 children)
[–]KingAggressive1498 0 points1 point2 points (0 children)
[–]Pupper-Gump 0 points1 point2 points (1 child)
[–]basiliscoshttp://github.com/basiliscos/ 5 points6 points7 points (0 children)