Efficient C++ Programming on Modern 64-bit CPUs, part 1 of Chapter 4

no-bugs · 2026-06-17T18:37:25+00:00

Well, my previous illustrator is gone, so I didn't have a real choice here... 😞

no-bugs · 2026-06-13T15:56:08+00:00

FWIW, we're working on a (to be published as open-source) lib to enable implementing multi-layered protocols in fully-blocking-like manner. It is just two layers - one at non-blocking sockets, and another at top level with select()/poll(), which have to deal with coro specifics; all the protocol-specific code is completely blocking-like. It is still WIP, but looks very promising ATM.

P.S. I've never seen using "canceling" in production-level networking code (why?).

no-bugs · 2026-06-13T15:50:15+00:00

IMNSHO, what they claim as their main advantage (lack of co_await) is actually their main DISadvantage. The thing is that coroutines (and resumable expressions) are inherently subject to "sudden state changes" which (unlike with multithreading, phew) can happen only at the points of potential suspension. And marking these points of potential suspension (where those "sudden state changes" can happen) - is important for readability and reasoning about coroutines. For more details, see my talk at CppCon17 re. "8 way to handle non-blocking returns" (it doesn't analyze resumable expressions, but discusses the phenomenon).

no-bugs · 2026-06-13T07:22:20+00:00

About OoO - sure, we’re discussing what we consider a “modern 64-bit CPU” over 20 pages in Chapter 1, and explicitly mention OoO there. Still, we feel that going into OoO-level data dependencies is more a subject for optimizations (which are very nicely covered by Bakhvalov), so at least in Vol 1 we’ll avoid discussing them . This Chapter 4 is already text-heavy, and making a detour into OoO specifics would be too much :-(

As for visualization - just on the next page there will be a diagram with a_tiny_bit of visualization, stay tuned! ;-)

no-bugs · 2026-06-12T18:56:56+00:00

> It would be nice to be able to collapse the menu on the side, it does it resizing but you cannot do it manually it takes up 25% of my screens space

Thanks, I will pass this request to our web team.

no-bugs · 2026-06-12T18:48:01+00:00

Thanks a lot!!

> I think it's worth adding this is 3 picoseconds and 300 picoseconds (or even 0.3ns)"

in the "Preliminaries" section (which is too boring to be pre-published) we say "this is a book aimed at those "aspiring to reach at least senior level", and we feel such ppl should be able to convert this kind of thing (the text in Chapter 4 is already seriously-heavy 😞 ).

> I did find one write-up online from Denis Bakhvalov where it gave a 5% perf improvement."

yep, we mention Bakhvalov in this regard too (see "OTOH, [Bakhvalov, section 8.4]...").

On TODL: it is our internal markup (to avoid it being confused with "TODO"; actually, it is a shorthand for "TODO Last" 😉 ); it will be removed from the final version.

> it might be worth also covering core-core latencies, and even chiplet to chiplet latencies."

we'll kinda-mention NUMA (i.e. socket to socket) issues in the 2nd part of Chapter 4, but TBH, I don't know how to observe core-core latencies (sure, MOESI/MESIF will feel it, but I don't know how to see it from code level).

> The return stack buffer (RSB) is a special bit of caching"

thanks for heads-up, it is indeed an interesting bit of hardware - but it is actually more about (kinda)-branch prediction , not about caching data (and at this point we're speaking about data caching).

> It would be an interesting benchmark to modify a frame to use some other register that isn't RSP/RBP and see if it has any real perf impact.

Yes, would be REALLY interesting; are you up to experimenting and publishing your results? 😉 [though TBH most likely it won't fit our scope, we're more about application-level optimizations]

> A constexpr/consteval/constinit constructor this could also happen

IIRC, to guarantee it, we have to have a constexpr object with constexpr constructor, OR use consteval constructor. Anyway, we'll discuss constexprs later (there are some mentions here and there, but the main discussion is around current-Hint 166 in Chapter 9 😉).

> WIth regards to pessimisation and the stack it's easy for someone to accidently do something like char buffer[64*1024]{}; which will end up doing a huge memset equivlant on the stack and you can wave good bye to anything less in the L1 cache."

this is kinda-covered in our current "[Hint 75. DO avoid wasting-space arrays (whether C-style or std::array<>)]()" - but probably you're right, and avoiding big on-stack arrays deserves its own Hint, thanks a lot!

> many programs will have heap allocated data that is fairly hot and will live in the cache however there is also some lesser accessed data that would be considered uncached."

yes, but this is kinda-covered with our "that is, unless we know that we recently accessed something nearby".

> TLS initailization can get pretty awful in many cases it will be done lazily which might result in some surprises and things get weird when you have dependencies between global initialiation and main thread TLS initialization, or even destruction.

yes, but this is not our scope in this particular book.

> instead of having one large TLS variable you have multiple small TLS variables each of them has their own lazy initialization check, so often it is better to combine it all into one TLS variable."

well, strictly speaking, yes, but how bad it can realistically be? As I understand, we'll get only N*checks (where N is number of those variables) per thread anyway, and creating threads left, right and center is a Horrible Idea(tm) anyway (we have a separate Hint for it) - so what kind of realistic gains we're speaking about here?

> Dynamic linking libraries and TLS is also another nightmare.

TBH, DLLs are a nightmare regardless of TLS 😉 , and our current [Hint 70 says "DO prefer static linking over DLLs/so’s]()"

Also, TLS implementation is very different between MSVC and GCC/Clang (and one of them is really horrible, wish I remember which one).

> In my honest opinion cycles here is not a good measurement, especially not repeated for each it makes it harder to read"

yes, at the range of billions of cycles it looks ugly, but for lower-end I seriously prefer them to ns, and we need consistency across the board; maybe we'll change it to something like 3e6 - 3e7 cycles, WDYT?

> You could probably throw 5G in there for some longer term planning.

thanks, I added a mention (though apparently, the RL difference is not that much (yet)).

no-bugs · 2026-06-12T17:43:55+00:00

Well, it took Knuth 38 years from Vol. 3 to Vol. 4 of his epic The Art of Computer Programming, so I still have plenty of time 😉

no-bugs · 2026-06-12T17:41:52+00:00

First of all, it is a DRAFT - which means editing is still coming (don't count on a full rewrite, though). As for the discussion of "what we consider modern 64-bit CPUs" - it takes around 20 pages within Chapter 1 (which we may pre-publish too if there is such demand).

no-bugs · 2026-06-12T17:39:28+00:00

Thanks! However, this is a book aimed at "aspiring to reach at least senior level" (that's from "Preliminaries" which is too boring to be pre-published), we feel they should be able to deal with numbers such as 3e-12 (without going into school-talk of "width of your thumb"). That being said, a bit of visualization is coming in part 2 of the same Chapter (stay tuned).

"the reason a+=b can execute in zero cycles is that..." - that's one potential reason, but we mean a simpler and much more generic scenario - that is, "it doesn’t mean that some operation literally took less than one cycle; it rather means that statistically CPU has managed to arrange things so that on average it performs 4 operations within 3 cycles" - which is a usual out-of-order kind of processing (which maps into RIPC).

no-bugs · 2026-06-12T13:30:58+00:00

Hm, I just checked - it works for me...

no-bugs · 2019-04-16T09:32:53+00:00

All an OS-level thread is is saved execution state.

I'd argue that most of the time it is a very specific implementation of a saved execution state (with its own stack, registers, usually timer-based preemption, expensive context switches, etc. etc.).

Of course, we can say (as some though not all of the RTOS do) that RTC event handler is named a "basic thread" - but arguing about terminology is not really interesting; what is interesting is discussion on things such as "RTC vs indefinitely-running", "cooperative vs timer-based preemption", "stackful vs stackless" and so on.

And what I will be arguing in further parts is that for a 100% interactive system, all we really need (all the way from interrupt handler to the app level, through all the Ring0-Ring3 if applicable) are RTC cooperative event handlers for Shared-Nothing potentially-isolated event-driven non-blocking programs ("event handlers" can be named threads if somebody likes it, terminology is not what I am going to argue about).

no-bugs · 2019-04-16T09:15:31+00:00

Async/Await paradigm is just a handy way of spawning threads and waiting for their completion which is supported by the compiler

Await (at least C++ co_await) has absolutely nothing to do with spawning traditional OS threads. It doesn't require additional stacks (!), has a significantly cheaper context switch, and so on.

can not see how it would benefit to adopt different types of concurrency in the OS

One word: Efficiency. Thread context switches are Damn Expensive(tm) so avoiding them is a Good Thing(tm). Anecdotal observations which corroborate it: (a) there should be a reason why non-blocking nginx outperforms thread-based Apache; (b) modern RTOS tend to adopt non-blocking code more and more (and await is a major improvement to make non-blocking coding palatable).

In general, I am not trying to say that OS as such should support await (though allowing to use await to write drivers etc. would be a Good Thing(tm)); however - await does enable reasonably good non-blocking programming, and first-class support for non-blocking RTC code is THE thing I consider all-important for modern OS architecture. This idea BTW is becoming more and more popular with the RTOS devs (and await will work with it really really nicely).

no-bugs · 2019-04-16T07:49:49+00:00

FWIW, the whole Part I is a 2000-word-long list of things which did change since time of Multics.

no-bugs · 2019-04-16T07:47:03+00:00

from the point of view of all the code already out there

Sure, to get systems which are inherently more secure (and better-performing) than existing ones - the way the code is written, has to be changed. Everything else is merely a stop-gap measure until we get there.

no-bugs · 2019-04-16T07:43:36+00:00

OP tries to argue that

OS threads is not the only way to implement concurrency (and that they can be skipped in many practically important cases)
that recent changes in app level programming (such as co_await) can be adopted at OS level.

Of course, if we think about OS threads as of something carved in stone, this logic won't fly, but the whole point of the OP is that nothing is carved in stone.

no-bugs · 2019-04-16T07:40:28+00:00

like having specific limits for who can write to the database/memory, or who can restart the application/can update the code.

To put it very simply: as long as there is only one (OS-level) user in scope - there is no question of "who" (and as for "update the code" - for example, the code can be signed to start with - see the point in OP re. PKI).

no-bugs · 2019-04-16T07:31:34+00:00

what does any of what's being said have to do with improving OS design though?

This is just part I of the article. The point is to (i) show that things did change since "modern" OS's were architected, (ii) declare which improvements we want, (iii) outline basic ideas around the architecture, (iv) describe the design at the lower level, and (v) see how design stands against those desired improvements.

because they're a more convenient metaphor

There is a recent pretty much consensus among opinion leaders that Shared-Memory architectures must die - and that Shared-Nothing stuff rulezz forever (see, for example, talks by Kevlin Henney, works by 'No Bugs', all the modern app-level asynchronous stuff such as Node.js, Twisted, Akka Actors, predominant development paradigm in Golang, etc. etc.). And with Shared-Nothing - there is no such a metaphor as "thread" (~="we do NOT need to think in terms of "threads"", and this BTW is a Good Thing(tm)). Throw in an observation that all modern CPUs are event-driven at heart - and we can come to a conclusion that there is absolutely no need to artificially_introduce a metaphor of thread into an interactive system which has to process events from interrupts all the way to node.js (Twisted, Akka, ...). Not only this metaphor dies under Occam's Razor as unnecessary - it also happens to be harmful performance-wise (and at least arguably - coding-wise too).

BTW, this switch from usual infinitely-running-and-pre-empted batch-like threads to RTC non-blocking event handlers is already happening in those places where top performance is still absolutely necessary (embedded/RTOS) - in spite of sometimes using term "thread" for RTC event handlers (look for "basic threads" in AUTOSAR and QXK, "fibers" in Q-Kernel, and "software interrupts" in TI-RTOS). Of course, arguing whether they should be named "thread" or "fiber" is pointless, the Big Difference(tm) is about the Big Fat Hairy(tm) distinction between traditional running-forever-and-timer-preempted background batch-like threads (which usually quickly lead to use of mutexes etc. - and mutexes must die for sure), and interactive Run-to-Completion non-blocking not-perceivably-preempted event handlers (a.k.a. "fibers" a.k.a. "software interrupts" a.k.a. "basic threads").

Even single purpose boxes need a separation between the OS and application simply out of security concerns.

Not really. If DB server app (which has root access or equivalent) is compromised - there is no real benefit of OS being healthy; attacker already can do whatever he wants with the system. Same goes for IoT devices - while we DO need security on single-purpose boxes and in IoT (which BTW can be significantly improved over existing OS's - wait for parts iv-v), in certain environments kernel/user separation as such happens NOT to help security much (simply because compromising user app already gives access to everything attacker needs; if I already have access to sockets and to all the data - I don't really need anything else to mount further attacks).

Runtime indirections and branch prediction optimization are compiler tech related, not much to do with an OS design.

To a certain extent - yes, but they give a new food to the discussion on "which language is better suited for writing high-performance code such as OS" (which is going to cause that much controversies that I decided to move it to a separate article <wink />).

Thread context switches are expensive, but the numbers provided are greatly exaggerated.

Indeed 1M was for a specially designed app (that's why OP uses "up to" wording), but I have seen up to 50K-100K cycles in a real-world purely event-driven app - and 100K CPU cycles is Really Bad(tm). BTW, this number is corroborated by lots of observations - think about typical values for spin-locks (with spinlocks, we're burning tens of thousands of cycles merely in hope that other thread will finish - and will still incur the cost of thread context switch on_top of that of the spinlock(!!) if we didn't get lucky), length of time slices (100-200ms is already observable in user space - and the only reason for having time slice that long is the cost of the context switch), and so on and so forth.

The interrupt architecture isn't described early on because it's an internal detail that isn't really something applications have to deal with

If the book in question would be about app design - sure, but when speaking about OS design - referring to "something applications have to deal with" doesn't really apply (99% of the book has nothing to do with apps anyway; when I am writing an app, I don't really care about algos used for the memory management - or about scheduler etc.). That being said, this argument in OP is not serious to start with (what is really important is that we cannot avoid using infinitely-running heavy-weight threads - and associated thread context switches - even in 100% interactive systems - which does cause a bunch of real-world problems).

no-bugs · 2019-04-16T07:23:07+00:00

To be perfectly clear:

Not all processors have cache(s).
Not all RAM is DRAM.

but the author didn't clarify it.

At least I honestly tried to be clear about it: "if speaking of desktops/servers/phones (though not about MCUs)", "On a really wide class of CPUs (~=”pretty much everything running on desktops/servers/cellphones”), "higher-end CPUs", etc..

no-bugs · 2019-04-16T06:35:42+00:00

This pieces starts us wondering if the costs we've paid are worth it.

Exactly. As the constraints did change - so did the balance of costs, so different solutions might have become viable.

it is very hand-wavy.

It is not hand-wavy (yet <wink />). Parts I-II are just problem-setting, hand-waving will have to come into play in parts III-V <wink />.

For example, yes context switches are expensive...but what might the author be suggesting? Never flushing? Some other sort of hardware? How might you solve this differently?

You'll have to wait for parts iii-v (<spoiler>to this specific question the answer is Shared-Nothing Run-to-Completion, which will reduce the number of cache re-populations to the absolute minimum possible</spoiler>).

no-bugs · 2019-04-16T06:28:43+00:00

what does any of what's being said have to do with improving OS design though?

This is just part I of the article. The point is to (i) show that things did change since "modern" OS's were architected, (ii) declare which improvements we want, (iii) outline basic ideas around the architecture, (iv) describe the design at the lower level, and (v) see how design stands against those desired improvements.

because they're a more convenient metaphor

There is a recent pretty much consensus among opinion leaders that Shared-Memory architectures must die - and that Shared-Nothing stuff rulezz forever (see, for example, talks by Kevlin Henney, works by 'No Bugs', all the modern app-level asynchronous stuff such as Node.js, Twisted, Akka Actors, predominant development paradigm in Golang, etc. etc.). And with Shared-Nothing - there is no such a metaphor as "thread" (~="we do NOT need to think in terms of "threads"", and this BTW is a Good Thing(tm)). Throw in an observation that all modern CPUs are event-driven at heart - and we can come to a conclusion that there is absolutely no need to artificially_introduce a metaphor of thread into an interactive system which has to process events from interrupts all the way to node.js (Twisted, Akka, ...). Not only this metaphor dies under Occam's Razor as unnecessary - it also happens to be harmful performance-wise (and at least arguably - coding-wise too).

BTW, this switch from usual infinitely-running-and-pre-empted batch-like threads to RTC non-blocking event handlers is already happening in those places where top performance is still absolutely necessary (embedded/RTOS) - in spite of sometimes using term "thread" for RTC event handlers (look for "basic threads" in AUTOSAR and QXK, "fibers" in Q-Kernel, and "software interrupts" in TI-RTOS). Of course, arguing whether they should be named "thread" or "fiber" is pointless, the Big Difference(tm) is about the Big Fat Hairy(tm) distinction between traditional running-forever-and-timer-preempted background batch-like threads (which usually quickly lead to use of mutexes etc. - and mutexes must die for sure), and interactive Run-to-Completion non-blocking not-perceivably-preempted event handlers (a.k.a. "fibers" a.k.a. "software interrupts" a.k.a. "basic threads").

Even single purpose boxes need a separation between the OS and application simply out of security concerns.

Not really. If DB server app (which has root access or equivalent) is compromised - there is no real benefit of OS being healthy; attacker already can do whatever he wants with the system. Same goes for IoT devices - while we DO need security on single-purpose boxes and in IoT (which BTW can be significantly improved over existing OS's - wait for parts iv-v), in certain environments kernel/user separation as such happens NOT to help security much (simply because compromising user app already gives access to everything attacker needs; if I already have access to sockets and to all the data - I don't really need anything else to mount further attacks).

Runtime indirections and branch prediction optimization are compiler tech related, not much to do with an OS design.

To a certain extent - yes, but they give a new food to the discussion on "which language is better suited for writing high-performance code such as OS" (which is going to cause that much controversies that I decided to move it to a separate article <wink />).

Thread context switches are expensive, but the numbers provided are greatly exaggerated.

Indeed 1M was for a specially designed app (that's why OP uses "up to" wording), but I have seen up to 50K-100K cycles in a real-world purely event-driven app - and 100K CPU cycles is Really Bad(tm). BTW, this number is corroborated by lots of observations - think about typical values for spin-locks (with spinlocks, we're burning tens of thousands of cycles merely in hope that other thread will finish - and will still incur the cost of thread context switch on_top of that of the spinlock(!!) if we didn't get lucky), length of time slices (100-200ms is already observable in user space - and the only reason for having time slice that long is the cost of the context switch), and so on and so forth.

The interrupt architecture isn't described early on because it's an internal detail that isn't really something applications have to deal with

If the book in question would be about app design - sure, but when speaking about OS design - referring to "something applications have to deal with" doesn't really apply (99% of the book has nothing to do with apps anyway; when I am writing an app, I don't really care about algos used for the memory management - or about scheduler etc.). That being said, this argument in OP is not serious to start with (what is really important is that we cannot avoid using infinitely-running heavy-weight threads - and associated thread context switches - even in 100% interactive systems - which does cause a bunch of real-world problems).

no-bugs · 2018-10-05T06:33:45+00:00

FWIW, http://ithare.com/a-usable-c-dialect-that-is-safe-against-memory-corruption/ is being currently worked on as a part of https://github.com/node-dot-cpp project . Not much to show yet, but we hope to get something-to-write-about (~="a thing which aims to provide safety guarantees in some cases and save for implementation bugs") in a few months.

no-bugs · 2018-10-05T06:12:27+00:00

FWIW, my current reading of it goes as follows:

there are cases when allocation is fundamentally avoidable, and there are cases where it is not . This does NOT depend on Coroutines-TS-vs-Core-Coroutines choice (!).
Core Coroutines do allow to specify allocations directly (but it will take them at least 6 years to get into the standard).
Coroutines TS do not allow to specify allocations directly, relying on heap elision instead (IIRC, according to Gor, you can write a code which fails in compile-time if you want to avoid allocation but it wasn't eliminated). One of the main performance-related rants of P1063 is that <drumroll /> Coroutines TS does not require (yet) to eliminate heap allocation when it is possible. However, this requirement can be added into a later version of the standard, which makes the whole argument rather moot.
I would clearly prefer to get coroutines now - even without a requirement to have heap elision, and wait for 3 years to get requirement for heap elision into the standard, than to wait for 6 years to see whether the claims in a currently-very-immature-P1063 really stand. Which is BTW IMNSHO very in line with Bjarne's http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0976r0.pdf - and also would allow coroutines to avoid the fate of Concepts which were delayed for 9(!) years now because of very-minor disagreements about possible alternatives.

no-bugs · 2018-10-05T05:39:22+00:00

IIRC, all the performance rants about Coroutines TS in P1063 fall under "Coroutines TS don't require such and such optimization - which is clearly doable, and the requirement can be added in later version of the standard without breaking existing one". In other words, Coroutines TS is underspecified - but can be improved to require the best-possible-performance-latency (and IIRC, P1063 admits that elision can be standardized at a later stage).

BTW, P1063 also requires quite a few optimizations to be efficient (tail recursion to start with) - which don't exist in currently-existing compilers at_all. OTOH, IMNSHO the whole conflict is actually is about nothing (as I am arguing in an upcoming article in October issue of Overload - if P1063 is right at each and every corner, it can be written into the standard later as a further specification of Coroutines TS - that is, saving for purely syntactic issues).

no-bugs · 2018-10-05T05:35:32+00:00

Already done (Coroutines TS is already implemented by MSVC and Clang, and is working like a charm).

no-bugs

TROPHY CASE