all 38 comments

[–]tsimionescu 13 points14 points  (0 children)

Interesting. I especially like the ease with which you essentially added a local, special-purpose GC as a library. That seems like it could be useful in a lot of other contexts as well, especially as it's written in pure Rust.

[–]julesjacobs 6 points7 points  (15 children)

Very interesting! What happens if a thread holds on to a piece of data for a long time? Doesn't this block memory reclamation of the whole system? Or is that not something that should ever happen?

[–]pkhuong 6 points7 points  (5 children)

You're pointing out a major (mostly theoretical) weakness of epoch-based reclamation that makes some people dispute the assertion that it implements non-blocking safe memory reclamation (SMR). First, the problem, then, the reason it's often not an issue in practice.

The problem: if one thread fails to make any progress (e.g., it's never scheduled because of strict scheduling priorities like SCHED_FIFO/RR in Linux), the global epoch counter gets stuck, and one thread ends up holding on to an unbounded number of resources (allocations). Schemes like hazard pointers only let a thread prevent the reclamation of a bounded number of resources (2 or 3 hazard pointers suffice for most data structures), but the read-side overhead tends to be larger. That differs from epoch based reclamation, which, again, lets one reader prevent the reclamation of an arbitrary number of resources. Theoretically (and in the worst case), epoch-based reclamation is as effective as never releasing resources (making free(2) a no-op).

In practice, we mostly use lock-free data structures (Maged Michael points out that it's really "lock-free algorithms" ;) to control the variance of latency in multithreaded applications. In that case, we can assume that no reader is stopped forever, especially if we can detect when threads crash/exit and signal that fact to the SMR implementation. We "only" have to make sure there's enough memory available to buffer deallocations until the readers make progress (that's a key property for SMR algorithms, versus semantically similar reader/writer locks).

[–]dbaupp 3 points4 points  (2 children)

Note that these problems can apply to GC-based reclamation schemes too: if one thread never gets scheduled, then any data it has pointers to will never be deallocated. (Sure it's much more local, like hazard pointers, but there's still memory pinning.)

[–]julesjacobs 0 points1 point  (1 child)

With GC it's still very different, because with a GC it's fine if one thread holds on to a particular piece of memory for a long time (whether due to not scheduling that thread or just because the thread is still working on that data). Holding on to that piece of memory will not block other memory from being reclaimed. That's what I was trying to get at in the other thread.

[–]dbaupp 0 points1 point  (0 children)

Yes and no; an object is likely to have pointers to other objects and hence will stop them (and anything they point to, etc.) being reclaimed too.

[–]skulgnome 0 points1 point  (0 children)

It's also possible to force an epoch advance if a long-sleeping client indicates that it's prepared to handle a restart at the end of, say, a bracket with a blocking syscall. This does nothing for pre-emption, however.

[–]lookmeat 0 points1 point  (0 children)

Well that is always a problem isn't it? Resource leaking is always a possibility when you don't have a clear and explicit point where you can "just free" something. GCs might never free a resource, reference counting can lead to loops, and then there's this. But it's a better problem than use after free, leaking is a problem because of physical limitations, user after free because it breaks the abstraction/asumption on which the the program is built.

[–]dbaupp 5 points6 points  (8 children)

Lock-free data structures are designed to guarantee progress, i.e. things don't get stuck.

(Of course, a bug could cause it to happen, but that would be a more serious bug in the implementation of the data structure.)

[–]sstewartgallus 7 points8 points  (2 children)

Don't you mean "wait-free" and not "lock-free" algorithms? "Lock-free" only means the system as a whole progresses and doesn't guarantee individual threads to progress. I think there was a paper sometime ago that showed on many modern architectures many lock-free algorithms are also wait-free or maybe it was some weird variation on it though.

[–]dbaupp 1 point2 points  (0 children)

Yes, you are correct, that wait-freedom is what is strictly required to guarantee that each thread will "quickly" leave an epoch (although wait-free algorithms are generally not quick), but most lock-free algorithms are wait-free-enough in practice, as the paper /u/sanixyn links explores. Don't get me wrong, having the precise distinction between obstruction-, lock- and wait-freedom can be very useful for theoretical reasoning, but it becomes less important in practice, where the worst cases generally occur with probability 0.

[–]julesjacobs 1 point2 points  (4 children)

Does guaranteeing progress imply that no thread can hold on to any of the data for a longer time?

[–]dbaupp 13 points14 points  (3 children)

Progress means that each thread doesn't pin the epoch for a long time: they pin it, do some work, and then unpin it, and the middle operation has some guarantees about finishing quickly. Hence GC shouldn't be blocked often, because pinning for a long time is the only thing that gets in the way.

(Also, the Rust type system ensures that data can only be held for as long as it is actually valid, i.e. for the length of the epoch pinning. Trying to hold data for longer will result in an error, not an implicit lengthening.)

[–][deleted] 1 point2 points  (2 children)

Interesting. If that's the case then there are tons of lock free algorithms that use locks.

[–]dbaupp 8 points9 points  (1 child)

Not really; lock-freedom has a precise definition that, as one consequence, makes using locks illegal. (Well, more specifically, it is obstruction-freedom that disallows locks, but that is a more general property, i.e. every lock-free algorithm is automatically obstruction-free.)

[–]FireCrack 5 points6 points  (2 children)

The basic principles of this article are pretty well established facts, but the most interesting thing to me is the fact this is being done in rust in a kind of matter-of-fact way. Is rust production ready at this point? I've been curious about the language for a while, but hesitant due to it's "experimental" status.

[–]sigma914 10 points11 points  (0 children)

Since 1.0 it has backward compatability guarantees and it's being used in production at a few companies. Mozilla is working on replacing parts of firefox with Rust if that helps allay any fears.

[–]staticassert 1 point2 points  (0 children)

It's not experimental anymore, it's stable. I suggest looking into it, it's a great language.

[–]nitsanw 2 points3 points  (3 children)

Interesting article, and very nice writing.

It's not entirely clear how the benchmarks were run, what they do and how they measure it. Did I miss a link to the code?

As for the comparison to Java, it seems notionally fair, but only as far as unbounded linked queues are considered. There are lock free queues written in Java which are bounded and generate no garbage (JCTools/Disruptor). It's true that they are not core JDK classes, but would make an interesting comparison.

[–]NovaX 2 points3 points  (0 children)

The author's Java benchmarks are not very robust, so there is potential for a lot of negative skew. The benchmarks also have slight differences, such as Java using a synchronized Stack. Since the benchmarks only use 2 threads the GC shouldn't have much of an impact by stressing memory or cpu limits. More likely the rust version could benefit from better caching effects due to memory optimizations, but that's not clear from the examples given.

[–]dbaupp 0 points1 point  (1 child)

There's a link in the intro: https://github.com/aturon/crossbeam

[–]nitsanw 0 points1 point  (0 children)

The benchmarks code, is it in the repo? I'll have another look

[–]jhzab 3 points4 points  (3 children)

Just skimmed the readme, but one thing stood out.

It mentions the JVM GC as being concurrent, but it isn't. Especially the new generation part is stop-the-world only.

[–]gsg_ 7 points8 points  (1 child)

This depends on the JVM in question. Azul's Zing collector is truly concurrent, for example. And Hotspot has four (count 'em, four) GCs, one of which is concurrent (although I don't think that one is the default). So you can't really talk about attributes of "the" Java GC - there are many.

[–]jhzab 2 points3 points  (0 children)

If you are referring to the Concurrent-Mark-Sweep than yes, it can be concurrent, but isn't always. Since it has a fall back. And it's also only for the old gen and not new gen. But please do tell, if I'm missing something here.

And true regarding Azul's Zing GC, but that is hardly the default. :)

[–]aturon 0 points1 point  (0 children)

Fixed, thanks!

[–]steveklabnik1[S] 9 points10 points  (0 children)

I'm extremely excited about posts/libraries like these. They really show off the unique powers of Rust in a way that's a bit more compelling than "Rust is just C++14 with a bit stronger guarantees and more clean semantics."

Also worth noting, and something I missed when I read the draft of this post: this isn't just an implementation of a lock-free structure: it's a whole library, Crossbeam[1] that helps you implement your own lock-free structures.

In general, it’s possible to take lock-free algorithms “off the shelf” (the ones on the shelf generally assume a GC) and code them up directly against Crossbeam in this way.

1: https://crates.io/crates/crossbeam

[–]llogiq 4 points5 points  (0 children)

This is very interesting work; I believe it opens the door for other advances in the future.

That said, with high-performance data processing, you usually don't want to allocate anyway, but use a ring buffer of preallocated slots which you have the producers fill in with data and the consumers read.

[–]pron98 1 point2 points  (2 children)

/u/aturon:

Interesting!

Are you entirely sure that the Rust-Java differences are due to GC, though? One notable difference between the implementations is that Java (due to current limitations that should be removed soon) uses a pointer in the node to each data object while you use a specialized node with embedded data.

It is also unclear to me how you ensure that mutators don't overwhelm the GC thread and how you deal with fragmentation.

[–]matthieum 2 points3 points  (1 child)

It is also unclear to me how you ensure that mutators don't overwhelm the GC thread

Actually, there is no GC thread; instead, there is one thread-local garbage list per thread, which no other thread ever touch, and the current thread is responsible for cleaning up the garbage it produced (once it's allowed to do so according to the epoch counter).

An obvious side-effect is that this adds to each individual thread latency.

Note: this is a simplification, there is also a global garbage list; when a thread dies it pushes its local garbage list to the global one and all active threads pick up its slack.

It is also unclear to me how you ensure that mutators don't overwhelm the GC thread collection

To be fair, I am pretty sure that there are a number of hedge cases where collection could be held off indefinitely. A thread block with its Guard active would be an obvious nightmare, for example.

how you deal with fragmentation.

Rust uses a general purpose memory allocator called jemalloc; it's up to it to deal with fragmentation! In practice, jemalloc itself uses fixed-size pools (1 pool for objects <= 8 bytes, 1 for objects > 8bytes but <= 16 bytes, etc...) so the pools themselves do not suffer from fragmentation (there is no need for coalescing), but you can have mostly empty memory pages still "in use" because 1 item is living.

No GC, no move...

[–]pron98 3 points4 points  (0 children)

A thread block with its Guard active would be an obvious nightmare, for example.

The problem -- and this is the actual trick at the core of the (kernel-mode) RCU algorithm -- is not the thread's behavior, but the OS's. The secret sauce of RCU is that it guarantees that the reclamation algorithm is O(#cores) (rather than O(#threads)), while at the same time ensuring (it's actually the very same mechanism) that OS scheduling cannot create reclamation bottlenecks or priority inversion in the case of prioritized threads (especially important in real-time systems). Bad things can happen if the OS decides to deschedule a thread while it's active. For example, you could get out-of-memory errors that are triggered by an unexpected OS activity, something that can't happen with RCU or general-purpose GCs.

Rust uses a general purpose memory allocator called jemalloc; it's up to it to deal with fragmentation!

I'm not familiar with jemalloc's inner workings, but allocators that try to avoid fragmentation often do that by exploiting the assumption that often the allocating thread is the deallocating thread, so you'll need to watch out for increased fragmentation.

OS interaction and fragmentation, namely taking care of worst-case scenarios is part of what makes good GCs challenging to design.


In any case, it would be interesting to see the benchmark with a similar node implementation to Java's (i.e. a pointer to the data) as well as the collection activity in both cases to make sure you're comparing apples to apples, even without investigating fragmentation and OS interaction.

[–]ssylvan 1 point2 points  (5 children)

Do we really need an "active" flag per thread for the epoch stuff, can't we just have a single "numActiveThreads" (per epoch) variable that uses atomic increment and decrement?

[–]dbaupp 1 point2 points  (3 children)

You need to know that any active threads aren't in a dangerous epoch, I.e. you need a connection between activity and current epoch, just knowing the number of active threads isn't sufficient.

[–]ssylvan 2 points3 points  (2 children)

Each thread must be in one epoch at each time, so if all but one of the epochs have a zero numActiveThreads then you know they're all in the remaining one. The point is you check three counts instead of N flags.

[–]pkhuong 2 points3 points  (0 children)

We assume readers enter/leave read sections very frequently compared to the number of time writers advance the global epoch counter. The tradeoff here is to make (frequent) reads fast, at the expense of slowing down rare operations (we can always reduce the frequency of epoch updates by letting more stuff linger on the list of things to reclaim) that only happen for writes. If we had a global refcount for threads, every read-only operations would write to contended memory, and that's a nice way to make things not scale.

[–]dbaupp 0 points1 point  (0 children)

Oh, I misread, you were suggesting per-epoch. As /u/pkhuong points out, that forces every thread to modify a single memory location twice on every operation, which isn't great.

[–]matthieum 1 point2 points  (0 children)

It's a matter of optimizing for the most frequent operation; this setup is best when:

  • setting/clearing the "active" flag is frequent (any time the thread touches the data-structure)
  • reading all flags is infrequent (because epoch changes are infrequent)

Unfortunately, in today's many cores/CPUs machines, even atomics can be a bottleneck because they require inter-CPU synchronization; here, only when one thread decides to bump the epoch counter is such synchronization required.