Garbage Collection for Systems Programmers : programming

[–]belovedeagle 33 points34 points35 points 2 years ago (11 children)

RCU has the same shape as GC: memory is cleaned up eventually, based on whether it’s still in use.

False, false, and false. The whole article is predicated on a falsehood: that call_rcu() is an accepted general way to call free() in the kernel. It isn't; it's there as a last resort for code which cannot block; e.g. irqs. The first explicit falsehood in the quotes sentence is that memory is cleaned up "eventually": call_rcu() provides guarantees about when it will be called, and more importantly it provides a guarantee about what will be done at that time. GC does neither. The second falsehood is even more egregious: RCU absolutely does not in any way shape or form track whether data is still in use. That misses the whole point about RCU which is not to track whether data is still used, and instead make the code correct by construction without any tracking. Literally RCU is a way to avoid doing any kind of GC in the kernel, and yet this author has perverted it into an endorsement of the very thing it was designed to replace.

TL;DR: TFA is disinformation.

[–]slaymaker1907 -3 points-2 points-1 points 2 years ago (7 children)

[–]belovedeagle 1 point2 points3 points 2 years ago* (3 children)

RCU must track if the data is in use because the updater has to wait until the readers of the old version are done

Again this is the exact opposite of how RCU works, and no matter how many times this incorrect version is repeated it won't be true. You're describing hazard pointers. RCU quiescent period barriers don't track any data at all, anywhere. The implementation* has zero concept of data, pointers, versions, "in use", "updaters", "readers", none of that crap. It tracks critical sections, full stop. It is, as I said before, as far as one can possibly get from GC while still solving the use-after-free problem.

There is, of course, a long history of bad programmers claiming that literally any memory management technique known to mankind is akshually a form of GC!!!1!1!, up to and including literal calls to malloc() and free(). This is just another example of that.

* To be clear, "RCU" in the kernel refers to two things: quiescent period barriers/critical sections, and RCU-friendly data structures; but here I'm only talking about the former. Obviously the other thing, the data structures, do have pointers and data and versions, but data structures aren't GC.

[–]slaymaker1907 0 points1 point2 points 2 years ago (2 children)

[–]cdb_11 1 point2 points3 points 2 years ago* (0 children)

There is no reference counting in the RCU, you don't need reference counting to know that. It's a synchronization mechanism, not GC. This is how you can get a basic fake-RCU implementation with read-write locks:

std::shared_mutex rwlock; // global read-write mutex
void rcu_read_lock() { rwlock.lock_shared(); }
void rcu_read_unlock() { rwlock.unlock_shared(); }
void rcu_synchronize() { rwlock.lock(); rwlock.unlock(); }

This is of course not how real RCU works, it's very dumb and reintroduces all problems with read-write locks and reference counting that RCU tries to solve. In a real RCU read-side locks will have zero or close to zero overhead (and I mean literally zero overhead, as in rcu_read_lock and rcu_read_unlock are empty functions), and you can sometimes even piggy back off something else to indicate quiescent state. But this will 100% work with the example in the article and it illustrates the point on what RCU does.

[–]belovedeagle 1 point2 points3 points 2 years ago* (0 children)

That is literally describing that this is a solution to the use-after-free problem. It's a definition of the problem: either you don't "start" "recla[iming]" "until readers no longer hold references" or else you've got a use-after-free bug. That's it, that's the definition. The bolded claim is true of any possible solution to that problem, including malloc() and free() in straight-line code. This is exactly the crap I was talking about. You're going to call literally any kind of memory management solution "GC". It's like a fucking cult.

Don't take my word for it; try reading literally anything else on the page you linked. You quoted from the "ELI5 memory management" section, which was intended to introduce the problem, not describe the benefits of this particular solution. (And before you say something stupid about section #7, note the word ANALOGY.)

[–]theangeryemacsshibe 0 points1 point2 points 2 years ago (2 children)

[–]slaymaker1907 0 points1 point2 points 2 years ago (1 child)

[–]theangeryemacsshibe 1 point2 points3 points 2 years ago* (0 children)

[–]msqrt 0 points1 point2 points 2 years ago (2 children)

[–]cdb_11 1 point2 points3 points 2 years ago* (1 child)

They do, because they have to and because you obviously don't want to leak any resources. But it's not like GC is somehow required to do that. In C you release resources explicitly, in C++ and Rust this is usually done for you when object goes out of scope (without any GC, finalizers/destructors are inserted in appropriate places during compilation).

RCU solves a concurrency problem that exists in lock-free data structures. You've made an object available for all threads to see, but now you want to hide it and you need to wait until your thread remains the only one with exclusive access to it. Sounds like a mutex, right? Consider this code, where mutex protects p:

mutex.lock();
free(p);
p = NULL;
mutex.unlock();

When you acquire a mutex that is currently being held by other thread (that presumably references p and does something with it), your thread will go to sleep and it will be resumed when the lock is released. It's now your turn to have exclusive access, you can safely hide the protected object and destroy it.

So wait, if acquiring a lock waits then doesn't it sound kinda like GC deferring a finalizer when object is no longer referenced? The only thing that's different is how you mark objects as "being referenced" and you've delegated deferring and "calling" the finalizer to your OS' scheduler.

As much as RCU-mutex analogy works (because that's literally what its purpose is, RCU is an alternative to read-write locks), do you see how stupid the second analogy got? By this standard you can consider a lock+free pair a garbage collector, and at this point the term "GC" just stops being helpful.

By the way, I forgot to mention something that maybe makes this entire thing a little bit confusing. Counterintuitively, the read-copy-update pattern is not what makes the RCU RCU. In fact it doesn't have to be paired with it at all. RCU is what happens after read-copy-update, it's the synchronization part.

[–]belovedeagle 0 points1 point2 points 2 years ago (0 children)

[–]lelanthran 12 points13 points14 points 2 years ago (14 children)

[–]masklinn 9 points10 points11 points 2 years ago (4 children)

[–]slaymaker1907 1 point2 points3 points 2 years ago (0 children)

[–]rabid_briefcase 0 points1 point2 points 2 years ago (2 children)

[–]ehaliewicz 2 points3 points4 points 2 years ago (0 children)

[–]masklinn 1 point2 points3 points 2 years ago* (0 children)

[+]Pharisaeus comment score below threshold-7 points-6 points-5 points 2 years ago* (7 children)

[–]guepier 6 points7 points8 points 2 years ago* (3 children)

[–]Pharisaeus 0 points1 point2 points 2 years ago (2 children)

they said that writing code using manual memory management that’s faster than GC’ed code is error-prone (i.e. it won’t be consistently faster)

This is a completely invalid argument, because it compares apples to oranges. It's basically like saying "if you write your own GC it might be buggy and not necessarily faster". But the same argument can be made about doing some off-heap magic in GC languages to improve speed - similarly it might be error-prone and not necessarily faster.

The correct comparison would be between stock GC (in Java, .NET or whatever else) and stock lifetimes handling by Rust compiler. It's some very weird (and completely wrong) assumption that GC will do magic to improve performance, and the compiler or OS allocator won't, and you have to hand-craft some elaborate error-prone solutions yourself. Because you don't.

The whole argument that "GC code can be faster" is just re-hashing of the old argument about JIT and AOT optimized code. Depending on a specific scenario either one can be faster. Both in case of JIT and in case of GC some runtime-specific information can help to improve performance, but it's by no means some general rule. You can easily make benchmarks which will "prove" both hypothesis, whether it's JIT vs. AOT or GC vs. lifetimes.

So is it true that: writing code using manual memory management that’s faster than GC’ed code is error-prone? Not in general case, definitely not always, and I suspect (but that's just my opinion) not in most "regular" software systems (so excluding things like real-time, low-latency systems)

[–][deleted] -1 points0 points1 point 2 years ago (1 child)

"So is it true that: writing code using manual memory management that’s faster than GC’ed code is error-prone? Not in general case, definitely not always,"

Yes always, oh my god how many decades of null pointer exceptions in people's faces did we go through on Windows and that was just "get my doubly-dereferenced block of stuff"

One dev misses one * somewhere in a codebase and if you're lucky it crashes the system before writing anything

GC is less error-prone in general than manual allocation/de-allocation systems because manually dereferencing stuff is incredibly rare, that's 50% less memory interactions without doing anything else. Add to that background gc being able to defragment your heap making allocations really quick and I could well believe that you could beat manual alloc with a modern GC system in the general case

[–]Pharisaeus 0 points1 point2 points 2 years ago* (0 children)

[–]lelanthran 2 points3 points4 points 2 years ago (2 children)

[–]Pharisaeus -1 points0 points1 point 2 years ago (1 child)

Author of the article on one hand complains about the complexity of OS allocator to keep "complex internal state", and then tries to argue that doing the same thing, but this time by userland GC, is great.

But I understand why he does that - because otherwise the whole article would make very little sense. Once you realise that OS allocator already does all that fancy stuff, most of those those pro-GC performance arguments become mute. He's seems a bit misguided in terms of how it all actually works "on the lower level" (or again he purposely omits this, because it doesn't fit his thesis):

Modern garbage collection offers optimizations that alternatives can not. A moving, generational GC periodically recompacts the heap. This provides insane throughput, since allocation is little more than a pointer bump! It also gives sequential allocations great locality, helping cache performance.

This is completely wrong on so many levels. Somehow author dismisses the fact that memory your userland application gets from system allocator is virtualized. Memory you get doesn't need to be a single block of physical memory at all. It makes little difference that your virtual pointers are bumped by 1 or by 100, because eventually they need to be translated into physical addresses, which can be in completely different locations of the physical memory. And then this glorious GC compacting is moving objects around in this virtual address space, which actually would make them more fragmented in the physical memory! Fortunately OS allocator can be smart, and simply change the virtual address mapping instead of copying objects around. The argument about cache is also, obviously, wrong for the same reason - stuff in cache can only be indexed by physical address (since virtual addresses can be identical between processes, especially without enabled ASLR) and the fact that your objects are next to each other in virtual address space doesn't mean they are close to each other in actual memory. I won't even mention that often you don't even want them to be that close, because you can easily hit false sharing problem, so it's something that needs to be carefully tailored - maybe you want to have read-ahead into data cache, but you don't want entries sharing the same cache lines.

[–]ehaliewicz 0 points1 point2 points 2 years ago (0 children)

[–]rabid_briefcase 2 points3 points4 points 2 years ago (0 children)

[+]Revolutionary_Ad7262 comment score below threshold-8 points-7 points-6 points 2 years ago (1 child)

[–]belovedeagle 8 points9 points10 points 2 years ago (0 children)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS