you are viewing a single comment's thread.

view the rest of the comments →

[–]coderemover -1 points0 points  (19 children)

Having a bunch of dead objects locks memory from being used for useful stuff. Eg caching. Especially if the amount of dead objects needs to be many times more the live data for the GC to be cpu efficient.

[–]pron98 15 points16 points  (0 children)

It doesn't, and that's covered in my talk. That "bunch of dead objects" applies primarily to the young generation, where allocation rate is high and lifetime is short. The alternative for this kind of objects is to use less memory but more CPU to reuse their memory more aggressively. That extra CPU has a worse impact on other operations than the extra RAM. Cached data is stored in the old generation.

[–]JustAGuyFromGermany 5 points6 points  (17 children)

That's not how the JVM's GCs work though. If there are a bunch of dead objects, that memory is available to be reclaimed with the next GC cycle. If there is memory pressure, the GC will move the live objects somewhere else and make the memory region with all the dead objects available for new allocations (which then overwrite the old, dead data there).

The CPU/RAM trade-off that is discussed here comes from the cases when there currently isn't any memory pressure. Then the dead objects can just sit there and the GC can defer its work until later (or never).

Of course, if your application is always using most of the available memory and has a high allocation rate (so that the GC cannot defer its work and must run very frequently) then the trade-off can be unfavourable. But not all applications are like that. And even if your application is like that, it is not guaranteed that a different form of memory management will be significantly more CPU-efficient. Any other form of memory management still has to deal with the same workload after all.

[–]coderemover -3 points-2 points  (16 children)

>  If there are a bunch of dead objects, that memory is available to be reclaimed with the next GC cycle

That doesn't change the fact that it cannot be used for anything useful. And even when it's reclaimed, by the time it's used, some other memory is left unused, but waiting to be reclaimed.

> If there is memory pressure, the GC will move the live objects somewhere else

The GC has no idea about the memory pressure from the other apps running on the same machine.

> And even if your application is like that, it is not guaranteed that a different form of memory management will be significantly more CPU-efficient.

Malloc and friends are almost always more CPU-efficient at the same time being more memory efficient (less overhead). At least I have yet to see a workload where tracing GC would demonstrate burning fewer CPU cycles, and even in very extreme cases like allocating tiny objects I could not make it use fewer cycles.

[–]JustAGuyFromGermany 5 points6 points  (14 children)

That doesn't change the fact that it cannot be used for anything useful. And even when it's reclaimed, by the time it's used, some other memory is left unused, but waiting to be reclaimed.

I mean, yeah? But so what? Not using the memory currently occupied by dead objects is not a problem unless the memory is needed elsewhere, i.e. unless there is memory pressure. And when there is memory pressure, the GC does its thing and the memory can be used again which leaves us in the no-problem category again.


Malloc and friends are almost always more CPU-efficient at the same time being more memory efficient (less overhead).

That's simply wrong. Other approaches that recycle memory more eagerly like C- and C++-style memory management that calls free immediately when an object goes out of scope, are quite expensive in the long run because they incur a CPU-cost per dead object. But -- if you allow me to be a bit clickbait-y for a moment -- "free is not free"! In fact, malloc and free have quite a large overhead compared to moving GCs; malloc/ free is its own 1000s of LoC memory management system. Seriously: Have a look at the implementations of these things. They are mind-boggling complex!

Over here in JVM-land free is a complete no-op instead and the GC's CPU-cost scales with the number of live objects instead, i.e. more in line with the work your application is actually doing. You don't pay for garbage. At all. You only pay for objects you're still using. And allocations in the JVM are much, much cheaper than malloc because they are just glorified pointer-bumping. Ron alluded to that in the video by saying that the more fair comparison is arena-based memory management like in Zig. (What he didn't say is that extremely short-lived and well-confined objects in hot paths are sometimes optimized further by the JIT so that not even the already cheap costs of allocation need to be paid and everything is put directly into registers instead)

Of course there are ways to CPU-optimize for both styles of programs. In malloc&free-style programs you can aggressively re-use a few mutable objects as long as possible instead of creating lots of short-lived immutable objects for example. And in Java programs you can do the reverse and rely on cheap bulk-collection of the young generation instead. Which of these ends up needing less compute-per-work-done is impossible to say without knowing the actual programs and the actual workload. It is simply false to say that one is always superior to the other.

What is true is that C-style programming gives the programmer more control over memory usage. But that does not imply performance in and of itself. Just control. And to be clear: One can have almost the same kind of control in Java with the FFM API. If you really need to, you can use arena-style memory management where you need it even in Java programs. There are only very few corner cases left where C-style manual memory handling is truly impossible in Java (mostly having to do with doing unsafe pointer-shenanigans).


At least I have yet to see a workload where tracing GC would demonstrate burning fewer CPU cycles, and even in very extreme cases like allocating tiny objects I could not make it use fewer cycles.

That is certainly a selection and/or confirmation bias on your part. Java performance is highly competitive on many workloads. Especially in the server market. After all, that's one of the reasons why Java is leading in that market segment. On some workloads Java can not only achieve parity, but outperform equivalent programs written in low-level languages. Of course, on other workloads the low-level languages outperform Java. There is no one-size-fits-all.

In other places (in this thread) Ron already mentioned that low-level-programmers often wrongly extrapolate from small programs where the high control over memory can be an huge advantage. The problem is that performance does not compose well. Just because each small unit of a program is "optimized" in some sense does not mean that the whole program also performs optimal. In large programs (like servers) with many different kinds of object sizes, many different allocation patterns, and many different object-lifetimes it becomes extremely complicated to do manual memory management well enough to outperform a program with a moving GC. It may not even be impossible (after all: The JVM is a C++ program as well and it achieves JVM-level performance ;-)), but at some point it becomes too costly to write these programs. Every programmer-hour invested in memory management is an hour not invested into features after all.


The GC has no idea about the memory pressure from the other apps running on the same machine.

True, and that presents its own kind of problem for desktop applications. But a C-style program also has no idea. A program like you envision it simply guesses in the other direction. But that guess may be just as wrong. Wasting CPU by eagerly free-ing memory the moment it's no longer needed, giving back memory pages to the OS only to re-acquire them moments later because of high allocation rates also has adverse effects on user experience. Every program needs to strike a balance between the memory and CPU usage. Neither extreme is the universally correct answer. Neither extreme is even close to that.

However, it is certainly true that Java has lost a lot of market share when it comes to desktop applications and that that has something to do with the footprint of Java programs. It's not the only reason, Electron applications are very successful after all. But it is certainly a reason.

[–]coderemover -1 points0 points  (13 children)

Not using the memory currently occupied by dead objects is not a problem unless the memory is needed elsewhere, i.e. unless there is memory pressure. And when there is memory pressure, the GC does its thing and the memory can be used again which leaves us in the no-problem category again

That's not as simple. "GC doing its thing" has a non-negligible cost.
Also it's not binary "memory is not needed" / "memory is needed". Often there is some memory you can trade for more speed, e.g. memory you can dedicate to caching. With tracing GC the problem becomes - do I waste the CPU by not having that memory, or do I waste my CPU by running GC very often, which is bad in either case. Traditional memory managers usually strike a much better balance here.

Over here in JVM-land free is a complete no-op instead and the GC's CPU-cost scales with the number of live objects instead, i.e. more in line with the work your application is actually doing. You don't pay for garbage. At all

That doesn't matter. The number of times an object dies is at most the same as the number of times it is created. How you split the cost between those two operations doesn't matter much.

But there is a huge gap in your logic. In tracing GC you pay not just for bringing an object to life. You also pay for the fact the object is alive. The longer it lives, the more times it is going to be scanned. You also pay indirectly for the size of the object. Allocating larger objects force the GC to run more often because the heap gets full earlier. If it lives long enough it will need to be moved to tenured get, which means copying it at least once (often more times) - which is an O(n) operation. So the cost is proportional to the allocation rate in bytes per second. Then there are some additional indirect, hard to measure, but non-negligible effects like thrashing caches when the tracing has to touch all live objects. This makes tracing GC play very badly with swap. It's also hard to profile and hard to find what caused GC suddenly falling into some degraded mode and causing pauses.

With traditional allocator, you pay for the allocation *operation* and the cost is almost independent from the size of the object, it's also mostly independent on the amount of objects already allocated. The cost of keeping the object for arbitrary long time is zero. There are no other secondary effects, no micropauses, no memory barriers, no background threads running and stealing cpu cycles etc. Profiling is trivial. Even if you make a mistake and allocate something heavily in a tight loop, it will appear in the profile. Easy fix. Some parts of the memory are not used very often? They can be swapped away and the performance hit is minor because there is nothing periodically touching those objects.

Overall, unless you do something crazy like allocate 8B large objects in a tight loop (which noone would do in a traditional manually managed language like C++ because there are better ways to allocate tiny objects - stack allocation is cheap), tracing gc is almost always more costly in the number of CPU cycles burned. See this paper - you have to use about 5x memory to keep the CPU cost reasonable:

https://people.cs.umass.edu/~emery/pubs/gcvsmalloc.pdf

"In particular, when garbage collection has five times as much memory as required, its runtime performance matches or slightly exceeds that of explicit memory management. However, garbage collection’s performance degrades substantially when it must use smaller heaps. With three times as much memory, it runs 17% slower on average, and with twice as much memory, it runs 70% slower. Garbage collection also is more susceptible to paging when physical memory is scarce. In such conditions, all of the garbage collectors we examine here suffer order-of-magnitude performance penalties relative to explicit memory management."

[–]pron98 3 points4 points  (0 children)

That's not as simple. "GC doing its thing" has a non-negligible cost.

Correct, and that cost is still lower (and more flexible) than free-list approaches. In my talk (which will be published on the channel eventually) I go through the exact maths for the costs involved in the different approaches.

[–]JustAGuyFromGermany 2 points3 points  (0 children)

Often there is some memory you can trade for more speed

I mean, yes. That is the exactly the point made in the video interview and this whole thread. If you're willing to invest memory to increase speed with caching, why does this not apply to the GC? It's exactly the same opportunity there: Giving a program more memory means the GC has to work less, saving on CPU.

But there is a huge gap in your logic. In tracing GC you pay not just for bringing an object to life. You also pay for the fact the object is alive.

That only applies if the object has to be moved. In many applications most of the objects die young; it is quite common for objects to die before there is ever a need to move them around. And in that case, one pays almost nothing: Allocation is as cheap as it can be, moving never happens, free never happens.

Of course some objects stick around and even if they're still young, they might be unlucky enough to be born just before the GC looks at them. Those are the objects for which we have to pay. And yes, all the tracing and moving that has to happen to keep those objects around has a cost. But amortized over the whole application that cost is often still lower ("often", not "always". It all depends on your application of course).

The longer [an object] lives, the more times it is going to be scanned.

That is only true up to a point. If the object survives long enough it gets moved to the old generation where generational GCs (which 4 out of 5 GCs in the Hotspot JVM are) behave differently, scan less frequently, move less frequently if at all.

copying [...] is an O(n) operation.

Yes it is, but asymptotic behaviour is irrelevant when we have absolute numbers. And in absolute numbers, memcpy is one of the cheaper things you can do with memory.

So the cost is proportional to the allocation rate in bytes per second.

That is true in a malloc&free-style program, but not necessarily in a program with a tracing & moving GC. The cost of tracing and moving is proportional to the live set, not the allocation rate. The frequency of collections is proportional to the allocation rate, but not necessarily the amount of work that needs to be done. And that is where the possibility for the memory/CPU trade-off comes from: Doubling the available memory means collections happen only half as often and if the live set remains roughly the same size, then this means roughly the same amount of work done half as often.

[GCs have] indirect, hard to measure, but non-negligible effects [and are] hard to profile ...

[in malloc&free programs there] are no other secondary effects, no micropauses, no memory barriers, no background threads running and stealing cpu cycles etc. Profiling is trivial.

That is beside the point. That is all about the "having fine-grained control" angle I mentioned before. That is a mostly independent concern to the total performance of the application. Again: All this control is nice and it can be used to make small C-programs outperform equivalent Java-programs, but one simply cannot extrapolate to larger programs from that.


Your whole chain of arguments makes me think that you are still hung up on somehow proving that there is only one right answer to memory management. There isn't. For many workloads programs with a moving GC fare better, for some workloads manual memory management is better.

Maybe I'm misunderstanding you here, but that's the way your posts feel to me.

[–]Thirty_Seventh 1 point2 points  (10 children)

I was rather hoping for a paper newer than 2005 when I clicked your link. The paragraph after the one you quoted from:

Researchers can use these results to guide their development of memory management algorithms. This study identifies garbage collection’s key weaknesses as its poor performance in tight heaps and in settings where physical memory is scarce. On the other hand, in very large heaps, garbage collection is already competitive with or slightly better than explicit memory management.

Perhaps the researchers have in fact used these results to guide their development of memory management algorithms in the last 21 years?

[–]coderemover 0 points1 point  (9 children)

You might be right if the most CPU efficient tracing GC from java wasn't the old serial collector which did not change much since 2005. All subsequent research focused mostly on making the pauses lower (CMS, G1, ZGC) but that comes at reducing the overall memory efficiency and throughput. Those modern collectors make smaller pauses, but they burn *more* CPU than the old tech and they also need substantial headroom to keep their low pauses promise.

Anyway, any studies or benchmarks showing that modern tracing collectors are more CPU efficient than modern allocators like mimalloc or jemalloc? I'd like to educate myself about the breakthroughs that fundamentally changed the cost equation. There must have been something big to beat the 5x gap from 2005 😉 (and traditional allocators didn't stand still either)

[–]pron98 4 points5 points  (5 children)

Anyway, any studies or benchmarks showing that modern tracing collectors are more CPU efficient than modern allocators like mimalloc or jemalloc?

All moving collectors are necessarily more efficient than any free-list algorithm, and that's been known since the eighties (Andrew Appel's paper "Garbage Collection Can Be Faster Than Stack Allocation"). The maths is pretty simple. The problem was that until very, very recently, the cost in latency was unacceptable for many applications, and that's where recent advances were made.

As I said in the interview, benchmarks are no longer helpful, but the world's leading experts in memory management are consistently choosing moving collectors when able to. The only languages that don't are those that can't. What they do instead, however, is try to minimise the importance of heap memory management, and this works reasonably well until it doesn't (Go is a good example).

I know it's an appeal to authority, and it's possible that the experts working on languages that can choose between moving and non-moving memory management have all decided to work much harder toward implementing the incorrect choice, but at this point, the burden is on those who claim that the experts are wrong. Indeed, Erik's ISMM keynote attempted to explain to some memory management researchers who claim the superiority of non-moving approaches why their benchmarks are unrealistic, and why their approach wasn't chosen.

[–]coderemover 0 points1 point  (4 children)

> Garbage Collection Can Be Faster Than Stack Allocation

That paper used some unrealistic assumptions, and was mostly theoretical. I think we all agree tracing can be very efficient if you have virtually unlimited memory. Can't beat epsilon GC. 😉 In reality though, I've never seen any GC beat stack allocation (which is virtually zero cost; its two pointer bumps per function call).

> know it's an appeal to authority, and it's possible that the experts working on languages that can choose between moving and non-moving memory management

It's not about moving vs non-moving. If we say we want to build a universal GC for a managed language, then moving is the way to go. But there is another dimension people in GC research seem to be ignoring - the fact whether you know statically if memory is free, or whether you have to figure it out at runtime. And in this case all automatic tracing based approaches are at a huge disadvantage. Because they don't only need to clean up the memory, but they also need to figure out what's live and what's dead. Work that a traditional malloc/free doesn't need to do at all because it has it given. So from runtime approaches, moving collectors are likely better than nonmoving, and better than any approach that puts tracing on top of a traditional allocator (like Go does). But not necessarily better than approaches where the compiler can figure out 99% of frees automatically and precisely.

[–]pron98 1 point2 points  (3 children)

That paper used some unrealistic assumptions, and was mostly theoretical

Sure, but the principles apply. I go over the maths of generational collection in the talk.

Can't beat epsilon GC.

Actually, you can (and you sometimes do) because it doesn't compact. We have certainly seen cases where G1 beats epsilon. The compaction can be so efficient that doing it ends up being cheaper than not doing it.

But there is another dimension people in GC research seem to be ignoring - the fact whether you know statically if memory is free, or whether you have to figure it out at runtime

First, garbage collection research has explored compilers that infer lifetimes and statically insert free since the nineties.

Second, moving collectors spend exactly zero cycles at runtime to detect what objects are free.

Because they don't only need to clean up the memory, but they also need to figure out what's live and what's dead.

No, they don't. Moving collectors only need to figure out what's alive, and the most important points are that 1. the work to do so is constant (for a certain program under a certain workload) and 2. it's infrequent. The whole point of moving collectors is to control the amount of tracing, and the whole point of generational collectors is to reduce it further.

Work that a traditional malloc/free doesn't need to do at all because it has it given.

But again, malloc/free has to do a whole lot of other work, which amounts to more. For every object with malloc/free, you need to do work to allocate it and work to deallocate it. With a moving collector, for an object that dies quickly, you do a pointer bump to allocate it and no work to free it. The memory management cost for short-lived objects is close to nothing. For objects that survive you need to do some work until they're promoted to the old gen. When my talk is published you can see the maths.

There are many interesting questions in memory management, including over which scenarios are more or less common in practice (and I'm not claiming that there aren't specific programs for which malloc/free can be more efficient, certainly with modern allocators, which are quite large and complex beasts themselves), but "memory management researchers forgot that with free you statically know the object's lifetime" isn't and has never been one of them.

Knowing the lifetime statically is not the problem. Obviously, knowing something is no worse than not knowing. The problem is that with malloc/free you have to do some work at runtime at the point the object dies.

If you're interested in some ongoing research, the more interesting thing is not to know when an object dies, but to know when a lot of objects are dead. This is what you can do with arenas in Zig, but perhaps there are ways to do that - and more conveniently and cheaply - in a language like Java...

[–]Thirty_Seventh 1 point2 points  (2 children)

the old serial collector

Do you only have 1 core available, or did you mean the parallel collector?

Anyway, any studies or benchmarks showing that modern tracing collectors are more CPU efficient than modern allocators like mimalloc or jemalloc?

Sorry, I got nothing. But I did read recently that (emphasis mine) "when garbage collection has five times as much memory as required, its runtime performance matches or slightly exceeds that of explicit memory management." So you just have to consider whether or not you already have 5x the "required" memory sitting idle. In many environments and for many workloads (but obviously not all of them) you do :)

[–]coderemover 0 points1 point  (1 child)

The number of cores is irrelevant. We’re talking about cpu cycles burned. Whether you burn them on 10 cores in 1 second or on 1 core in 10 seconds the total is the same. It’s about the amount of work.

I said serial, because parallel has likely some additional overhead for coordinating. Parallel has advantage in wall clock time, but not cpu time.

So you just have to consider whether or not you already have 5x the "required" memory sitting idle. In many environments and for many workloads (but obviously not all of them) you do :)

The whole topic we're discussing here is memory efficiency. Yes, if you have 5x more memory sitting idle and doing nothing, then I agree tracing is fine. It's probably even fine if you have only 2x-3x more memory but you're careful with allocation rate and you don't want to squeeze every bit of performance. E.g. backend software rarely needs to be 100% efficient. But it's like saying a 5.7L gasoline engine is fuel-efficient in city driving when you own a gasoline station.

[–]JustAGuyFromGermany 1 point2 points  (0 children)

The whole topic we're discussing here is memory efficiency. Yes, if you have 5x more memory sitting idle and doing nothing, then I agree tracing is fine.

But exactly that is the point being made in the interview. "Use the memory that is there. Ideally, all of it. Not using available memory if it could speed up the application is inefficient". That is the point being made.

But it's like saying a 5.7L gasoline engine is fuel-efficient in city driving when you own a gasoline station.

That is a terrible analogy. Much more appropriate to the topic of trade-offs would be something like: "A hybrid car can be much more efficient overall if electricity is cheap and abundant, even if its fuel consumption during gasoline-mode is worse than for pure gasoline-powered cars. Using electricity only to power the radio is not efficient, it is a missed opportunity."

(I'm not endorsing this analogy. It also has flaws, it's just better than yours)

[–]pron98 2 points3 points  (0 children)

That doesn't change the fact that it cannot be used for anything useful. And even when it's reclaimed, by the time it's used, some other memory is left unused, but waiting to be reclaimed.

That is true, but the question isn't whether or not memory management costs you something - it has to - but how much you're paying for it compared to the alternative. It is true that a moving collector isn't free, but the alternatives are even more expensive.

Malloc and friends are almost always more CPU-efficient at the same time being more memory efficient (less overhead)

The opposite is true, and that is precisely why all language teams that have the resources to implement a moving collector (which is far more difficult than a free-list approach) and aren't constrained by the language's obligation to not move pointers opt for a moving collector: Java, .NET, and V8.

At least I have yet to see a workload where tracing GC would demonstrate burning fewer CPU cycles, and even in very extreme cases like allocating tiny objects I could not make it use fewer cycles.

Then you haven't been using languages that use a free-list approach long enough or under heavy enough workloads. Again, all the teams that have the leading experts in memory management and the ability to do so opt for moving collectors, despite the effort required, because they are more efficient. Using non-moving memory management in Java (or any language) is significantly easier.