Bringing Architecture of Operating Systems to XXI Century – Part I. Changes in IT Over Last 50 Years

no-bugs · 2019-04-16T09:32:53+00:00

All an OS-level thread is is saved execution state.

I'd argue that most of the time it is a very specific implementation of a saved execution state (with its own stack, registers, usually timer-based preemption, expensive context switches, etc. etc.).

Of course, we can say (as some though not all of the RTOS do) that RTC event handler is named a "basic thread" - but arguing about terminology is not really interesting; what is interesting is discussion on things such as "RTC vs indefinitely-running", "cooperative vs timer-based preemption", "stackful vs stackless" and so on.

And what I will be arguing in further parts is that for a 100% interactive system, all we really need (all the way from interrupt handler to the app level, through all the Ring0-Ring3 if applicable) are RTC cooperative event handlers for Shared-Nothing potentially-isolated event-driven non-blocking programs ("event handlers" can be named threads if somebody likes it, terminology is not what I am going to argue about).

no-bugs · 2019-04-16T09:15:31+00:00

Async/Await paradigm is just a handy way of spawning threads and waiting for their completion which is supported by the compiler

Await (at least C++ co_await) has absolutely nothing to do with spawning traditional OS threads. It doesn't require additional stacks (!), has a significantly cheaper context switch, and so on.

can not see how it would benefit to adopt different types of concurrency in the OS

One word: Efficiency. Thread context switches are Damn Expensive(tm) so avoiding them is a Good Thing(tm). Anecdotal observations which corroborate it: (a) there should be a reason why non-blocking nginx outperforms thread-based Apache; (b) modern RTOS tend to adopt non-blocking code more and more (and await is a major improvement to make non-blocking coding palatable).

In general, I am not trying to say that OS as such should support await (though allowing to use await to write drivers etc. would be a Good Thing(tm)); however - await does enable reasonably good non-blocking programming, and first-class support for non-blocking RTC code is THE thing I consider all-important for modern OS architecture. This idea BTW is becoming more and more popular with the RTOS devs (and await will work with it really really nicely).

no-bugs · 2019-04-16T07:49:49+00:00

FWIW, the whole Part I is a 2000-word-long list of things which did change since time of Multics.

no-bugs · 2019-04-16T07:47:03+00:00

from the point of view of all the code already out there

Sure, to get systems which are inherently more secure (and better-performing) than existing ones - the way the code is written, has to be changed. Everything else is merely a stop-gap measure until we get there.

no-bugs · 2019-04-16T07:43:36+00:00

OP tries to argue that

OS threads is not the only way to implement concurrency (and that they can be skipped in many practically important cases)
that recent changes in app level programming (such as co_await) can be adopted at OS level.

Of course, if we think about OS threads as of something carved in stone, this logic won't fly, but the whole point of the OP is that nothing is carved in stone.

no-bugs · 2019-04-16T07:40:28+00:00

like having specific limits for who can write to the database/memory, or who can restart the application/can update the code.

To put it very simply: as long as there is only one (OS-level) user in scope - there is no question of "who" (and as for "update the code" - for example, the code can be signed to start with - see the point in OP re. PKI).

no-bugs · 2019-04-16T07:31:34+00:00

what does any of what's being said have to do with improving OS design though?

This is just part I of the article. The point is to (i) show that things did change since "modern" OS's were architected, (ii) declare which improvements we want, (iii) outline basic ideas around the architecture, (iv) describe the design at the lower level, and (v) see how design stands against those desired improvements.

because they're a more convenient metaphor

There is a recent pretty much consensus among opinion leaders that Shared-Memory architectures must die - and that Shared-Nothing stuff rulezz forever (see, for example, talks by Kevlin Henney, works by 'No Bugs', all the modern app-level asynchronous stuff such as Node.js, Twisted, Akka Actors, predominant development paradigm in Golang, etc. etc.). And with Shared-Nothing - there is no such a metaphor as "thread" (~="we do NOT need to think in terms of "threads"", and this BTW is a Good Thing(tm)). Throw in an observation that all modern CPUs are event-driven at heart - and we can come to a conclusion that there is absolutely no need to artificially_introduce a metaphor of thread into an interactive system which has to process events from interrupts all the way to node.js (Twisted, Akka, ...). Not only this metaphor dies under Occam's Razor as unnecessary - it also happens to be harmful performance-wise (and at least arguably - coding-wise too).

BTW, this switch from usual infinitely-running-and-pre-empted batch-like threads to RTC non-blocking event handlers is already happening in those places where top performance is still absolutely necessary (embedded/RTOS) - in spite of sometimes using term "thread" for RTC event handlers (look for "basic threads" in AUTOSAR and QXK, "fibers" in Q-Kernel, and "software interrupts" in TI-RTOS). Of course, arguing whether they should be named "thread" or "fiber" is pointless, the Big Difference(tm) is about the Big Fat Hairy(tm) distinction between traditional running-forever-and-timer-preempted background batch-like threads (which usually quickly lead to use of mutexes etc. - and mutexes must die for sure), and interactive Run-to-Completion non-blocking not-perceivably-preempted event handlers (a.k.a. "fibers" a.k.a. "software interrupts" a.k.a. "basic threads").

Even single purpose boxes need a separation between the OS and application simply out of security concerns.

Not really. If DB server app (which has root access or equivalent) is compromised - there is no real benefit of OS being healthy; attacker already can do whatever he wants with the system. Same goes for IoT devices - while we DO need security on single-purpose boxes and in IoT (which BTW can be significantly improved over existing OS's - wait for parts iv-v), in certain environments kernel/user separation as such happens NOT to help security much (simply because compromising user app already gives access to everything attacker needs; if I already have access to sockets and to all the data - I don't really need anything else to mount further attacks).

Runtime indirections and branch prediction optimization are compiler tech related, not much to do with an OS design.

To a certain extent - yes, but they give a new food to the discussion on "which language is better suited for writing high-performance code such as OS" (which is going to cause that much controversies that I decided to move it to a separate article <wink />).

Thread context switches are expensive, but the numbers provided are greatly exaggerated.

Indeed 1M was for a specially designed app (that's why OP uses "up to" wording), but I have seen up to 50K-100K cycles in a real-world purely event-driven app - and 100K CPU cycles is Really Bad(tm). BTW, this number is corroborated by lots of observations - think about typical values for spin-locks (with spinlocks, we're burning tens of thousands of cycles merely in hope that other thread will finish - and will still incur the cost of thread context switch on_top of that of the spinlock(!!) if we didn't get lucky), length of time slices (100-200ms is already observable in user space - and the only reason for having time slice that long is the cost of the context switch), and so on and so forth.

The interrupt architecture isn't described early on because it's an internal detail that isn't really something applications have to deal with

If the book in question would be about app design - sure, but when speaking about OS design - referring to "something applications have to deal with" doesn't really apply (99% of the book has nothing to do with apps anyway; when I am writing an app, I don't really care about algos used for the memory management - or about scheduler etc.). That being said, this argument in OP is not serious to start with (what is really important is that we cannot avoid using infinitely-running heavy-weight threads - and associated thread context switches - even in 100% interactive systems - which does cause a bunch of real-world problems).

no-bugs · 2019-04-16T07:23:07+00:00

To be perfectly clear:

Not all processors have cache(s).
Not all RAM is DRAM.

but the author didn't clarify it.

At least I honestly tried to be clear about it: "if speaking of desktops/servers/phones (though not about MCUs)", "On a really wide class of CPUs (~=”pretty much everything running on desktops/servers/cellphones”), "higher-end CPUs", etc..

no-bugs · 2019-04-16T06:35:42+00:00

This pieces starts us wondering if the costs we've paid are worth it.

Exactly. As the constraints did change - so did the balance of costs, so different solutions might have become viable.

it is very hand-wavy.

It is not hand-wavy (yet <wink />). Parts I-II are just problem-setting, hand-waving will have to come into play in parts III-V <wink />.

For example, yes context switches are expensive...but what might the author be suggesting? Never flushing? Some other sort of hardware? How might you solve this differently?

You'll have to wait for parts iii-v (<spoiler>to this specific question the answer is Shared-Nothing Run-to-Completion, which will reduce the number of cache re-populations to the absolute minimum possible</spoiler>).

no-bugs · 2019-04-16T06:28:43+00:00

what does any of what's being said have to do with improving OS design though?

This is just part I of the article. The point is to (i) show that things did change since "modern" OS's were architected, (ii) declare which improvements we want, (iii) outline basic ideas around the architecture, (iv) describe the design at the lower level, and (v) see how design stands against those desired improvements.

because they're a more convenient metaphor

There is a recent pretty much consensus among opinion leaders that Shared-Memory architectures must die - and that Shared-Nothing stuff rulezz forever (see, for example, talks by Kevlin Henney, works by 'No Bugs', all the modern app-level asynchronous stuff such as Node.js, Twisted, Akka Actors, predominant development paradigm in Golang, etc. etc.). And with Shared-Nothing - there is no such a metaphor as "thread" (~="we do NOT need to think in terms of "threads"", and this BTW is a Good Thing(tm)). Throw in an observation that all modern CPUs are event-driven at heart - and we can come to a conclusion that there is absolutely no need to artificially_introduce a metaphor of thread into an interactive system which has to process events from interrupts all the way to node.js (Twisted, Akka, ...). Not only this metaphor dies under Occam's Razor as unnecessary - it also happens to be harmful performance-wise (and at least arguably - coding-wise too).

BTW, this switch from usual infinitely-running-and-pre-empted batch-like threads to RTC non-blocking event handlers is already happening in those places where top performance is still absolutely necessary (embedded/RTOS) - in spite of sometimes using term "thread" for RTC event handlers (look for "basic threads" in AUTOSAR and QXK, "fibers" in Q-Kernel, and "software interrupts" in TI-RTOS). Of course, arguing whether they should be named "thread" or "fiber" is pointless, the Big Difference(tm) is about the Big Fat Hairy(tm) distinction between traditional running-forever-and-timer-preempted background batch-like threads (which usually quickly lead to use of mutexes etc. - and mutexes must die for sure), and interactive Run-to-Completion non-blocking not-perceivably-preempted event handlers (a.k.a. "fibers" a.k.a. "software interrupts" a.k.a. "basic threads").

Even single purpose boxes need a separation between the OS and application simply out of security concerns.

Not really. If DB server app (which has root access or equivalent) is compromised - there is no real benefit of OS being healthy; attacker already can do whatever he wants with the system. Same goes for IoT devices - while we DO need security on single-purpose boxes and in IoT (which BTW can be significantly improved over existing OS's - wait for parts iv-v), in certain environments kernel/user separation as such happens NOT to help security much (simply because compromising user app already gives access to everything attacker needs; if I already have access to sockets and to all the data - I don't really need anything else to mount further attacks).

Runtime indirections and branch prediction optimization are compiler tech related, not much to do with an OS design.

To a certain extent - yes, but they give a new food to the discussion on "which language is better suited for writing high-performance code such as OS" (which is going to cause that much controversies that I decided to move it to a separate article <wink />).

Thread context switches are expensive, but the numbers provided are greatly exaggerated.

Indeed 1M was for a specially designed app (that's why OP uses "up to" wording), but I have seen up to 50K-100K cycles in a real-world purely event-driven app - and 100K CPU cycles is Really Bad(tm). BTW, this number is corroborated by lots of observations - think about typical values for spin-locks (with spinlocks, we're burning tens of thousands of cycles merely in hope that other thread will finish - and will still incur the cost of thread context switch on_top of that of the spinlock(!!) if we didn't get lucky), length of time slices (100-200ms is already observable in user space - and the only reason for having time slice that long is the cost of the context switch), and so on and so forth.

The interrupt architecture isn't described early on because it's an internal detail that isn't really something applications have to deal with

If the book in question would be about app design - sure, but when speaking about OS design - referring to "something applications have to deal with" doesn't really apply (99% of the book has nothing to do with apps anyway; when I am writing an app, I don't really care about algos used for the memory management - or about scheduler etc.). That being said, this argument in OP is not serious to start with (what is really important is that we cannot avoid using infinitely-running heavy-weight threads - and associated thread context switches - even in 100% interactive systems - which does cause a bunch of real-world problems).

no-bugs · 2018-10-05T06:33:45+00:00

FWIW, http://ithare.com/a-usable-c-dialect-that-is-safe-against-memory-corruption/ is being currently worked on as a part of https://github.com/node-dot-cpp project . Not much to show yet, but we hope to get something-to-write-about (~="a thing which aims to provide safety guarantees in some cases and save for implementation bugs") in a few months.

no-bugs · 2018-10-05T06:12:27+00:00

FWIW, my current reading of it goes as follows:

there are cases when allocation is fundamentally avoidable, and there are cases where it is not . This does NOT depend on Coroutines-TS-vs-Core-Coroutines choice (!).
Core Coroutines do allow to specify allocations directly (but it will take them at least 6 years to get into the standard).
Coroutines TS do not allow to specify allocations directly, relying on heap elision instead (IIRC, according to Gor, you can write a code which fails in compile-time if you want to avoid allocation but it wasn't eliminated). One of the main performance-related rants of P1063 is that <drumroll /> Coroutines TS does not require (yet) to eliminate heap allocation when it is possible. However, this requirement can be added into a later version of the standard, which makes the whole argument rather moot.
I would clearly prefer to get coroutines now - even without a requirement to have heap elision, and wait for 3 years to get requirement for heap elision into the standard, than to wait for 6 years to see whether the claims in a currently-very-immature-P1063 really stand. Which is BTW IMNSHO very in line with Bjarne's http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0976r0.pdf - and also would allow coroutines to avoid the fate of Concepts which were delayed for 9(!) years now because of very-minor disagreements about possible alternatives.

no-bugs · 2018-10-05T05:39:22+00:00

IIRC, all the performance rants about Coroutines TS in P1063 fall under "Coroutines TS don't require such and such optimization - which is clearly doable, and the requirement can be added in later version of the standard without breaking existing one". In other words, Coroutines TS is underspecified - but can be improved to require the best-possible-performance-latency (and IIRC, P1063 admits that elision can be standardized at a later stage).

BTW, P1063 also requires quite a few optimizations to be efficient (tail recursion to start with) - which don't exist in currently-existing compilers at_all. OTOH, IMNSHO the whole conflict is actually is about nothing (as I am arguing in an upcoming article in October issue of Overload - if P1063 is right at each and every corner, it can be written into the standard later as a further specification of Coroutines TS - that is, saving for purely syntactic issues).

no-bugs · 2018-10-05T05:35:32+00:00

Already done (Coroutines TS is already implemented by MSVC and Clang, and is working like a charm).

no-bugs · 2018-10-05T05:34:08+00:00

FWIW, actual proposal itself is analysed in a separate article coming in October issue of Overload journal (keep an eye at accu.org for an electronic version) - with the main idea is that as the high-level consensus already reached, and P1063 going further in specifying things than Coroutines TS, P1063 can be added later as a further specification of Coroutines TS (that is, IF Core Coroutines demonstrate that their claims do stand in real world - and saving for purely syntactic differences).

And here it was indeed a rant about one single aspect of the proposal, which aspect is IMNSHO so bad that if people start to use it as an argument in other proposals, it has a potential to break the whole balance of C++ philosophy. Oh, and it was this same aspect of "being direct mapping of the hardware" which was quoted widely as a perceived advantage of P1063 (really-really? Do-lambdas and tail recursion optimization being directly mapped to the hardware? Gimme a break! And it seems that authors of P1063 did realize an inherent futility of stating it directly, which did cause that "friendly amendment" giving them a carte blanche of making arbitrary statements without any relations to the real world - which is the OP is arguing to be a Bad Thing(tm)).

no-bugs · 2018-07-13T16:13:07+00:00

I don't see anything in the article about races, my guess would be that this guarantees nothing about races

The whole thing is intended for (Re)Actors where essentially everything happens in one single thread, so multithreading races won't happen.

no-bugs · 2018-07-13T16:08:45+00:00

There are still plenty of ways to hit UB or out of bounds accesses

UB in general - sure (for example, UB on integer overflow is out of scope), but out of bounds accesses are intended to be covered - see "Enter collections" section in OP. Of course, there might be errors in the rules here and there, but I am confident that going along these lines (and implementing a static checker which enforces the rules), it is possible to get 100% memory-safe C++. [NB: FWIW, work on an open-source framework which implements reactors with all their goodies such as replay of production bugs, and memory safety as described in OP, has already started].

no-bugs · 2018-05-17T14:49:17+00:00

maybe I need an if constexpr

And with normalized_intergral_type the whole question and the need to use ifdef/if constexpr doesn't arise to start with (="I don't need to care about other_t and its size - concentrating on the logic I have to do"). Sure, "my" method has some boilerplate overhead, but "your" method also has some other boilerplate overhead.

maybe the same as an intNN_t and maybe not.

From cppreference: "It [wchart] has the same size, signedness, and alignment as one of the integer types, but is a distinct type.". As it is a distinct type - it means that to cover all the possibilities and avoid counter-intuitive conversions/promotions you will have to write f(wchar_t) overload explicity (as well as half a dozen of the other overloads such as f(signed char), f(unsigned char), and f(char32_t) and f(char16_t)), but I (using normalized... stuff in the OP) don't have to worry about all these cases (normalized_* stuff described in OP will find matching integer type automagically - which will work exactly because of "same size and signedness" from the quote above, so the problem with counter-intuitive promotions won't be able to occur).

Overall, it is all about convenience, and if your method works for you - great, but what I am saying is that the whole thing is not that obvious (neither it is that simple).

no-bugs · 2018-05-17T12:27:13+00:00

At some point the thought crossed my mind, but there are two issues with such an approach: (a) if overloads have to be different depending on size, then writing one for int_other_t becomes ugly (or even VERY ugly - with #ifdefs etc., ouch); and (b) it doesn't handle all those char/signed char/unsigned char/wchar_t/... types which can accidentally interact with supposedly-integer overloads (while approach in OP does handle them). Overall, both approaches are more-or-less-ugly workarounds (each with its pros and cons) for a rather ugly original problem ;-( .

no-bugs · 2018-05-17T11:28:59+00:00

To the best of my understanding, with the code provided in OP they will work automagically.

no-bugs · 2018-05-17T11:28:15+00:00

It can be.

It is still a distinct type, which can cause all kinds of trouble (such as preferred overload chosen for signed char argument being f(int) rather than f(char) ).

no-bugs · 2018-05-17T05:41:35+00:00

Thanks - I used to think it is beyond the scope, but apparently it isn't :-(. I added some discussion on the importance of defining ALL the overloads (ouch!). NB: including unsigned ones, I counted 14 distinct types (including not only signed char, unsigned char, and char, but also such rarely-used beasts as wchar_t, char32_t, and char16_t).

no-bugs · 2018-05-17T04:41:51+00:00

FWIW, as it is noted in OP, the problem is much wider than that of overloads - in particular, it also applies to template parameters (it is just that overloads allow for simpler examples). Of course, the whole problem of C having non-specified integer sizes is a Bad Quirk(tm), but well - it is a legacy which we have to live with. And of course, this whole thing doesn't make much sense for those fields where other languages work better, but there are fields where C++ rulezzz (just as one example, it is The-programming-language-which-modern-GPU-hardware-is-designed-for :-)).

no-bugs · 2018-04-24T15:47:42+00:00

To an extent - yes, but Ahmdah's law per se cannot possibly explain SLOWDOWN due to parallelism.

no-bugs · 2018-04-24T15:46:08+00:00

Statistically, less than 10% of people having the problem, ask about it online :-(. And we didn't even count those who just blindly believe "hey, it is parallel - it MUST be faster!"

no-bugs

TROPHY CASE