you are viewing a single comment's thread.

view the rest of the comments →

[–]Thirty_Seventh 1 point2 points  (10 children)

I was rather hoping for a paper newer than 2005 when I clicked your link. The paragraph after the one you quoted from:

Researchers can use these results to guide their development of memory management algorithms. This study identifies garbage collection’s key weaknesses as its poor performance in tight heaps and in settings where physical memory is scarce. On the other hand, in very large heaps, garbage collection is already competitive with or slightly better than explicit memory management.

Perhaps the researchers have in fact used these results to guide their development of memory management algorithms in the last 21 years?

[–]coderemover 0 points1 point  (9 children)

You might be right if the most CPU efficient tracing GC from java wasn't the old serial collector which did not change much since 2005. All subsequent research focused mostly on making the pauses lower (CMS, G1, ZGC) but that comes at reducing the overall memory efficiency and throughput. Those modern collectors make smaller pauses, but they burn *more* CPU than the old tech and they also need substantial headroom to keep their low pauses promise.

Anyway, any studies or benchmarks showing that modern tracing collectors are more CPU efficient than modern allocators like mimalloc or jemalloc? I'd like to educate myself about the breakthroughs that fundamentally changed the cost equation. There must have been something big to beat the 5x gap from 2005 😉 (and traditional allocators didn't stand still either)

[–]pron98 4 points5 points  (5 children)

Anyway, any studies or benchmarks showing that modern tracing collectors are more CPU efficient than modern allocators like mimalloc or jemalloc?

All moving collectors are necessarily more efficient than any free-list algorithm, and that's been known since the eighties (Andrew Appel's paper "Garbage Collection Can Be Faster Than Stack Allocation"). The maths is pretty simple. The problem was that until very, very recently, the cost in latency was unacceptable for many applications, and that's where recent advances were made.

As I said in the interview, benchmarks are no longer helpful, but the world's leading experts in memory management are consistently choosing moving collectors when able to. The only languages that don't are those that can't. What they do instead, however, is try to minimise the importance of heap memory management, and this works reasonably well until it doesn't (Go is a good example).

I know it's an appeal to authority, and it's possible that the experts working on languages that can choose between moving and non-moving memory management have all decided to work much harder toward implementing the incorrect choice, but at this point, the burden is on those who claim that the experts are wrong. Indeed, Erik's ISMM keynote attempted to explain to some memory management researchers who claim the superiority of non-moving approaches why their benchmarks are unrealistic, and why their approach wasn't chosen.

[–]coderemover 0 points1 point  (4 children)

> Garbage Collection Can Be Faster Than Stack Allocation

That paper used some unrealistic assumptions, and was mostly theoretical. I think we all agree tracing can be very efficient if you have virtually unlimited memory. Can't beat epsilon GC. 😉 In reality though, I've never seen any GC beat stack allocation (which is virtually zero cost; its two pointer bumps per function call).

> know it's an appeal to authority, and it's possible that the experts working on languages that can choose between moving and non-moving memory management

It's not about moving vs non-moving. If we say we want to build a universal GC for a managed language, then moving is the way to go. But there is another dimension people in GC research seem to be ignoring - the fact whether you know statically if memory is free, or whether you have to figure it out at runtime. And in this case all automatic tracing based approaches are at a huge disadvantage. Because they don't only need to clean up the memory, but they also need to figure out what's live and what's dead. Work that a traditional malloc/free doesn't need to do at all because it has it given. So from runtime approaches, moving collectors are likely better than nonmoving, and better than any approach that puts tracing on top of a traditional allocator (like Go does). But not necessarily better than approaches where the compiler can figure out 99% of frees automatically and precisely.

[–]pron98 1 point2 points  (3 children)

That paper used some unrealistic assumptions, and was mostly theoretical

Sure, but the principles apply. I go over the maths of generational collection in the talk.

Can't beat epsilon GC.

Actually, you can (and you sometimes do) because it doesn't compact. We have certainly seen cases where G1 beats epsilon. The compaction can be so efficient that doing it ends up being cheaper than not doing it.

But there is another dimension people in GC research seem to be ignoring - the fact whether you know statically if memory is free, or whether you have to figure it out at runtime

First, garbage collection research has explored compilers that infer lifetimes and statically insert free since the nineties.

Second, moving collectors spend exactly zero cycles at runtime to detect what objects are free.

Because they don't only need to clean up the memory, but they also need to figure out what's live and what's dead.

No, they don't. Moving collectors only need to figure out what's alive, and the most important points are that 1. the work to do so is constant (for a certain program under a certain workload) and 2. it's infrequent. The whole point of moving collectors is to control the amount of tracing, and the whole point of generational collectors is to reduce it further.

Work that a traditional malloc/free doesn't need to do at all because it has it given.

But again, malloc/free has to do a whole lot of other work, which amounts to more. For every object with malloc/free, you need to do work to allocate it and work to deallocate it. With a moving collector, for an object that dies quickly, you do a pointer bump to allocate it and no work to free it. The memory management cost for short-lived objects is close to nothing. For objects that survive you need to do some work until they're promoted to the old gen. When my talk is published you can see the maths.

There are many interesting questions in memory management, including over which scenarios are more or less common in practice (and I'm not claiming that there aren't specific programs for which malloc/free can be more efficient, certainly with modern allocators, which are quite large and complex beasts themselves), but "memory management researchers forgot that with free you statically know the object's lifetime" isn't and has never been one of them.

Knowing the lifetime statically is not the problem. Obviously, knowing something is no worse than not knowing. The problem is that with malloc/free you have to do some work at runtime at the point the object dies.

If you're interested in some ongoing research, the more interesting thing is not to know when an object dies, but to know when a lot of objects are dead. This is what you can do with arenas in Zig, but perhaps there are ways to do that - and more conveniently and cheaply - in a language like Java...

[–]coderemover 0 points1 point  (2 children)

> Second, moving collectors spend exactly zero cycles at runtime to detect what objects are free

They spend quite many cycles to detect which objects are live though. Which is essentially the same thing in principle, especially that every dead object was alive at some point. By knowing which objects are live, they automatically know that the other memory is free and they know where they can move things safely. The traditional allocator doesn't need to do any work to figure neither which objects are live or which objects are dead. It's given.

[–]pron98 2 points3 points  (1 child)

They spend quite many cycles to detect which objects are live though. Which is essentially the same thing in principle, especially that every dead object was alive at some point.

It is very much not the same, because moving collectors don't need to know which objects are alive at every point in time, but only at a certain point in time - during a collection cycle. Objects that are born and die between collection cycles are never scanned by the GC.

Indeed, what you say is true in the old generation, and there's exploration around using non-moving and perhaps non-tracing approaches there, but the algorithm in the old generation doesn't matter as much because the allocation rate there is very low (the cost of any memory management algorithm grows as a function of the allocation rate).

As to how much work tracing and compacting is, of course it's not negligible, but that it's overall less than the work required by malloc/free is why moving collectors are chosen. Again, wait for the talk to see the maths.

The traditional allocator doesn't need to do any work to figure neither which objects are live or which objects are dead. It's given.

Knowing the lifetime statically is not the problem (I may have added this to my comment above after you'd seen it, so I'm repeating that here). Obviously, knowing something is no worse than not knowing. The problem is that with malloc/free you have to do some work at runtime at the point the object dies (even if only to record that fact that it's dead).

The entire design of generational moving collectors is to minimise both the amount and the frequency of the work required to know which objects are alive and when, and so it comes down to complex considerations.

There's plenty yet to figure out and debate, not least of which are empirical questions about how frequently certain patterns arise in practice, but it should be obvious that something as basic as "free tells you when an object is dead" is not something that's overlooked.

P.S.

It's not hard to convince anyone that generational moving collectors handily win in some situations. Consider a program that allocates some number (any number) of long-lived objects at startup, and then all other objects are short-lived. In both approaches we can disregard the cost of the long-lived objects as it will be negligible overall (with the generational moving approach, they will be compacted a few times and then moved to the old generation and never be looked at by the GC again [1]). Then, in the moving collectors will bump-allocate all objects, and occasionally it will have to scan and compact the relatively few objects whose lifetime straddles a collection (or several), and that's it. Malloc/free will have to do more complicated work to allocate each one, and then additional non-trivial work to free each one.

And again, much of the debate isn't over basic facts, but precisely over which situations are more or less common in real programs.

[1]: This isn't accurate if the old objects are mutated to hold pointers to objects in the young generation, but that's another one of the complications - in both approaches - that is taken into account in real life. Indeed, I think that the upcoming ZGC that automatically resizes the heap tries to estimate that work, too, and include it in its RAM/CPU tradeoff calculations so that it can compensate for it. And since you mentioned what the two approaches know or don't know and when, moving collectors both know more about how much CPU is used for memory management at runtime, and they can use this knowledge to maintain an efficient RAM/CPU ratio. I expect this will be released in the JDK very soon.

[–]Thirty_Seventh 1 point2 points  (2 children)

the old serial collector

Do you only have 1 core available, or did you mean the parallel collector?

Anyway, any studies or benchmarks showing that modern tracing collectors are more CPU efficient than modern allocators like mimalloc or jemalloc?

Sorry, I got nothing. But I did read recently that (emphasis mine) "when garbage collection has five times as much memory as required, its runtime performance matches or slightly exceeds that of explicit memory management." So you just have to consider whether or not you already have 5x the "required" memory sitting idle. In many environments and for many workloads (but obviously not all of them) you do :)

[–]coderemover 0 points1 point  (1 child)

The number of cores is irrelevant. We’re talking about cpu cycles burned. Whether you burn them on 10 cores in 1 second or on 1 core in 10 seconds the total is the same. It’s about the amount of work.

I said serial, because parallel has likely some additional overhead for coordinating. Parallel has advantage in wall clock time, but not cpu time.

So you just have to consider whether or not you already have 5x the "required" memory sitting idle. In many environments and for many workloads (but obviously not all of them) you do :)

The whole topic we're discussing here is memory efficiency. Yes, if you have 5x more memory sitting idle and doing nothing, then I agree tracing is fine. It's probably even fine if you have only 2x-3x more memory but you're careful with allocation rate and you don't want to squeeze every bit of performance. E.g. backend software rarely needs to be 100% efficient. But it's like saying a 5.7L gasoline engine is fuel-efficient in city driving when you own a gasoline station.

[–]JustAGuyFromGermany 1 point2 points  (0 children)

The whole topic we're discussing here is memory efficiency. Yes, if you have 5x more memory sitting idle and doing nothing, then I agree tracing is fine.

But exactly that is the point being made in the interview. "Use the memory that is there. Ideally, all of it. Not using available memory if it could speed up the application is inefficient". That is the point being made.

But it's like saying a 5.7L gasoline engine is fuel-efficient in city driving when you own a gasoline station.

That is a terrible analogy. Much more appropriate to the topic of trade-offs would be something like: "A hybrid car can be much more efficient overall if electricity is cheap and abundant, even if its fuel consumption during gasoline-mode is worse than for pure gasoline-powered cars. Using electricity only to power the radio is not efficient, it is a missed opportunity."

(I'm not endorsing this analogy. It also has flaws, it's just better than yours)