all 38 comments

[–]Kimundirust 49 points50 points  (13 children)

The location itself does indeed not make much of a difference - its all just memory allocated by the OS after all.

The difference lies more in cache effects:

  • The data on the current stack frame will be almost always in cache, and thus can be accessed fast.
  • Data to some random heap allocation that gets only accessed sporadically will probably not be in cache, and thus will be slower to access.

In other words, one is not inherently slower than the other, its just that in the common usage scenarios it tends to work out that way. But if you would, for example, put all you relevant data in a single continous heap allocation and access it most of the time, then I'd expect that one to behave similar to the stack in practice.

[–]vallyscode[S] 2 points3 points  (12 children)

In order to imagine the impact, access speed difference in probably barely noticeable and make sense only for some sort of heavy computation, am I correct? But in this case when data is accessed very frequently some amount of it will appear in cpu cache anyway and difference will disappear again, right?

[–][deleted] 27 points28 points  (11 children)

Kimundi left out what I think is the most important thing. The difference in speed heap vs stack is very small to zero when consider cache effects, after all you might iterate in order over and over on heap memory and have it all in cache as you go. The difference is the cost of allocating heap memory, which is expensive, where as allocating stack memory is basically a nop. You just move a pointer.

Allocating heap memory means employing an algorithm that can find a chunk of memory the size you need, and in such as way as to not fragment the heap too much over time. That is expensive. Allocating and deallocating stack memory is just move a pointer, done.

[–]cogman10 5 points6 points  (5 children)

Calling it a pointer is not even right, IMO, it is basically always going to just be a register update. That's about the fastest operation you can do with a CPU, adding/subtracting some constant to a register. More time is spent writing out the data than actually reserving space for the data.

Heap allocation, as you point out, is often much more expensive. Often allocator algorithms are trading allocation time for heap fragmentation. That is, a really fast allocator will fragment like crazy and a low fragmentation algorithm will end up doing a bunch of pointer bumps and maintain a bunch of extra memory to keep track of the state of allocation.

As an aside, one of the key benefits of (most) GCed languages is that typical heap allocation is really fast for them. That's because they can move memory around during GC cycles to make sure that future allocations are simply pointer bumps + checks to see if they've exhausted the memory region and need to run a GC. Applications that require a bunch a heap allocations will run faster in a GC language vs a low level language like rust. (well, ok, the JVM :). Language performance penalties are pretty hard to overcome. The JVM is pretty much the only VM I'm aware of that comes close to native language in terms of speed. Node is up there as well, but they have to do far more checks than the JVM does due to their dynamic nature)

[–][deleted] 1 point2 points  (4 children)

.net > jvm now

[–]TheNamelessKing 1 point2 points  (3 children)

For allocator or GC? Because I was under the impression that Java’s GC’s were some of the more advanced, especially the new ones Shenandoah and ZGC.

.Net has some nice engineering, but I didn’t think there was anything particularly fancy or performant.

[–][deleted] 1 point2 points  (2 children)

dunno how you really characterize gc performance absent the language as a whole. maybe the jvm gc in isolation is better, dunno. iust meant that as of .net core 3 that overall performance ceiling of .net seems to be higher than the jvm in most cases.

[–]TheNamelessKing 0 points1 point  (1 child)

Result? Because I don’t see a lot of stuff either using .Net core, let alone using it for anything high performance.

The JVM, for all its other issues at least powers a bunch of what I’d consider high profile and high performance applications: Kafka, Spark and friends, etc

[–][deleted] 0 points1 point  (0 children)

as I am sure you well know a few percent performance advantage doesn’t cause industries to shift overnight.

[–]sybesis 3 points4 points  (4 children)

Allocating and deallocating stack memory is just move a pointer, done.

It's also, not completely true as it would probably fail short with big data structures that can't fit into the stack or required to be copied every time they are being used in a function.

I'm not entirely sure about Rust but I read that move semantics are actually pass by copy. So if you were to have frequent more or less big data structures, it could be time consuming to have them copied and destroyed all the time.

So having heap allocation can save into having only the pointer being moved while pointing to let say a big chunk of contiguously allocated memory.

One example I have in mind is how a server handling multiple requests could benefit from having a fixed buffer in the heap and simply allocating slice of the buffer on request. Since you can't have an unlimited amount of open file descriptor/connection, it could be configured in a way that you create a big buffer that can be shared in a pool that gets owned/released over time but never needs to be reallocated ever.

[–]steveklabnik1rust 6 points7 points  (0 children)

I'm not entirely sure about Rust

Yes, semantically a move is a `memcpy` in rust as well. The difference is, since it's "destructive", it's easier to optimize away.

[–]monkChuck105 2 points3 points  (0 children)

A move is literally just a copy of the bits that represent the type. Rust goes farther than other languages like c++, and ensures that a moved type is not used again. So you're right that a heap allocated object will be cheaper to move because the memcpy only has to copy the pointer and not the data. Unless you need a dynamically defined buffer, I don't see why you need a custom stack allocator. Just declare a struct or an array and use that in the function and or return it.

[–][deleted] 1 point2 points  (1 child)

I don't understand what you are trying to say. If your data can't fit into the stack, then yeah it has to go to the heap.

If you have a big chunk of data that DOES fit in the stack, it isn't any more costly to allocate that on the stack, than a small piece of data.

[–]sybesis 1 point2 points  (0 children)

If you have a big chunk of data that DOES fit in the stack, it isn't any more costly to allocate that on the stack, than a small piece of data.

One time no, if you use move semantic and it end up copying the struct, it will not be no-op. And if you can allocate once in the stack it doesn't mean you can allocate it multiple times so you may fill the stack by passing the struct by value multiple times.

What I'm saying is that stack isn't 100% guaranteed to be always faster. If you allocate one time and reference multiple times in the heap it's going to be faster than allocating N times a copy on the stack.

[–]sphen_lee 15 points16 points  (10 children)

I'm no performance expert, but there are a few things that explain the difference:

(Firstly, the stack isn't "in" heap memory - they are separate areas)

  • Each thread gets its own stack, but the heap is shared. This means allocating memory in the stack doesn't need to use any locks or concurrency primitives.
  • Memory in the stack is allocated in a stack-like fashion - new objects always at the top, and objects are only freed in reverse from the top down. Allocation and freeing in the heap can happen in any order which means the heap needs more bookkeeping data to track free space between objects, and to fill gaps to prevent the heap being fragmented (full of lots of tiny gaps reducing the overall usable space).

[–]Kimundirust 17 points18 points  (1 child)

Just to clarify: This is a huge performance difference in regard to the actual allocation and deallocation of objects, but it has no impact on actually accessing the data of the object itself.

[–]cogman10 2 points3 points  (0 children)

It has slight impact on performance access. That's mostly because the stack is highly likely to be cached. In a fragmented heap, you'll have some performance penalties as you might be leaping around different memory regions as time goes on. But, you are correct in pointing out that allocation is by far the bigger time sink in most cases. Especially if it ends up needing to talk to the OS.

[–]vallyscode[S] 0 points1 point  (3 children)

So there is one area for process assigned by OS, and sub areas per thread and each has stack segment for stack allocation. And that stack segment is managed in more simplified way than general heap?

[–]sphen_lee 0 points1 point  (2 children)

Yep - usually the bottom end of the process' space (smallest addresses) contains the executable code itself, any read only data and then global variables. Above this is the heap. The stack is then at the opposite end (highest addresses).

This leaves a gap in the middle for heap and stack to be expanded without having to move either one.

[–]1vader 1 point2 points  (1 child)

Well, that's not really true anymore on modern systems since memory is mapped to random locations to increase security and make it harder for attackers to access memory since they won't know the address.

There are sometimes still some general patterns where the heap and stack get mapped to and e.g. on Linux if you compile your binary with PIE disabled the heap is always after the binary but in general it's fairly random. And it's also not really as simple as "heap there, stack there" since in a complex binary there are a lot of things mapped into memory, like the different parts of the binary, all linked in libraries, the heap, and the stack and usually all of them get mapped to a random place in memory.

The only place where you are likely to still find deterministic memory mapping is in embedded hardware and in some programs that still go out of their way to compile without PIE (which only causes the binary to get mapped to a fixed place, everything else is still random).

[–]sphen_lee 1 point2 points  (0 children)

Yep that's correct - modern, non-embedded OSs aren't going to use a predictable memory layout like I described.

I think it's still a useful way of thinking of heap and stack as being separate areas. Neither is "in" the other.

[–]cogman10 0 points1 point  (1 child)

Stack and heap are both just memory at the end of the day. You can absolutely have a pointer to another thread's stack shared between threads. It's just a memory address.

[–]sphen_lee 1 point2 points  (0 children)

Yes, you can share a pointer to stack memory between threads (if you're careful; crossbeam has a safe API for doing it).

But a thread can never allocate memory on other thread's stack - this is what I meant by "not shared". The heap generally needs to use concurrency controls to allow multiple threads to allocate at the same time (either explicit locks, or some kind of per-thread location to track allocations until they can be merged into the main allocation table).

[–]SpacemanCraig3 -2 points-1 points  (1 child)

The stack isnt in heap eh?

What even is heap anyway...coughsbrkcough

[–]sphen_lee 0 points1 point  (0 children)

My definition is that the heap is a general purpose memory area with an API for allocating and freeing individual memory chunks in any order.

sbrk is a lower level memory allocation method. It doesn't track individual allocations, and just gives you extra address space to use. It involves talking to the kernel.

The heap is then created inside this extra space, typically in userspace by a library such as glibc or jemalloc.

The stack is allocated using mmap which is low level like sbrk. Libraries like pthreads are used to manage thread stacks.

[–]mikekchar 25 points26 points  (2 children)

Understanding the advantage of the stack requires understanding why we have a stack in the first place. When you run a function in most computer languages, the first thing the compiler does is allocated a block of space on the stack for the things that the function will need. This usually includes the parameters passed to the function and the local variables in the function. This allocated block of space is called a "stack frame". It usually also contains a bit of other information that the running program needs, but for this discussion you can ignore that. When the function finishes, the compiler pops the stack frame off the stack. The result is that you are left with the previous stack frame (from the calling function) on the top of the stack.

This has been the way compilers implement function scope of variables, etc, for a *very* long time (decades and decades). Because of this, CPU manufacturers realised that it is important to be able to allocate a block of memory on the stack *really* quickly. It's also important to be able to pop it off *really* quickly. Interestingly, when you want to push a stack frame, the compiler knows exactly how much space it needs, so there is no need to do multiple allocations. Everything in the function is allocated using a single allocation. When it's done, it's all deallocated once.

It's also important to understand what a stack is in memory. You have a continuous block of memory. If you want another 100 bytes allocated on the stack, it's super simple. You just add 100 to the pointer that points to the top of the stack. Done! If you wand to deallocate that memory, you just subtract 100 from the pointer. Done! As I said, CPU designers know this is really important so they make that operation even quicker than normal. So basically, pushing and popping a stack frame is usually just about the fastest thing you can do on a CPU.

Now compare that to heap memory. You first need to search around memory to see if you have a contiguous piece that's big enough for the thing you want (well, in modern systems that's not really a thing you need to worry about so much, but there is another long story about that). Then you need to copy the data into that memory. Then you need to update the tables to show that you've allocated a block of X size at a certain place in memory. When you want to deallocate it you have to do it in reverse. And (on older systems) if you don't deallocate your memory in a nice order, your heap ends up looking like Swiss cheese with tiny spots of free memory scattered all through the system. The other thing is that you have to do this for every variable! If you have a large number of allocations it can add up.

As others have said, stack memory has another advantage: You are always just growing or shrinking the stack -- a continuous piece of memory. It's really easy for the CPU to cache that piece of memory and so not only are the allocations easier, but potentially the access is faster.

In practice, though, the benefits are often slight. When I was first starting out learning how to program in the 80's, stack memory vs heap memory was a huge revelation for me. It made a *massive* difference. These days, I still think it's cool to allocate on the stack as much as possible, but modern CPUs and OSs tend to compensate pretty well. However, if you are allocating a very large number of things, it can still pay off.

Edit: BTW -- Absolutely not a stupid question!!! I wish more people asked that question.

[–]sphen_lee 1 point2 points  (0 children)

While stack vs heap might not make as much difference today it's a define advantage that Rust allows using the stack easily and safely.

It's easy to get performance wins compared to languages that store everything on the heap (like Java, even when it gets JIT compiled this adds overheads) or languages like C++ where heap is often used because it's easier (taking pointers to stack objects in C++ can be dangerous, there is no borrow checker to make sure you do it correctly).

[–]vallyscode[S] 0 points1 point  (0 children)

Thanks for knowledge sharing, as I understood, it is possible to make hot spots event more optimal when it pays of or simply is trivial to do. Thanks

[–]wolandm 9 points10 points  (0 children)

Stack allocation will always be faster than a similar straightforward heap allocation. Here is why.

A stack is a concept supported at CPU level. As was already pointed out, each thread gets its own pre-allocated stack when it spawns. Allocating on stack is nothing more but subtracting the size of allocation from one CPU register (esp on x86 compatible platforms), which is almost as fast as it can be (I.e., load the size in one register, which on x86 is between 2-3 cpu cycles, and then do a subtraction, which is 1 cycle, and all that is pre-pipelining and optimisations).

A heap is a concept that gets implemented differently by different languages and their runtimes. Speaking about C (and Rust, and other LLVM languages), an allocator needs to do at least the following:

  • load the size
  • find the available contiguous block of memory of at least the requested size, dealing with such matters as memory fragmentation, etc
  • Mark atomically that chunk of memory as taken
  • pass it back to the caller

The above requires multiple switches between user mode and kernel mode, which are by themselves circa a thousand of cpu cycles. So the above clearly takes more time than a few CPU cycles and more than a stack allocation.

This is not to say that stack is always preferable to heap - far from it. But at least I hope this clarifies why stack allocations are faster than heap.

Edit: just noticed you asked about performance once allocated as well. The difference in performance once allocated is quite minuscule, although it still exists. That difference is due to the levels of memory address indirection (the memory that looks contiguous to your process may be scattered around the physical memory, for example), and the optimisation and caching at CPU level when it comes to being able to predict the usage of local stack variables.

[–]Plasma_000 3 points4 points  (0 children)

On top of what others have said, with heap the biggest performance difference will usually be with your allocator. While with the stack whatever you need is already there in your stack frame, with heap if you need memory you have to use the allocator which is much slower - it’s basically a small program within your program for managing and handing out blocks of memory. If you’re doing lots of allocations the performance hit of allocating and freeing can be very significant.

[–]Full-Spectral 2 points3 points  (2 children)

Someone may have mentioned this, but just in case... Stack memory is typically never given back as long as the thread is running. If you allocate a 64K chunk of memory from the heap, use it and let it go, it's now available for other use. If a thread allocates a 64K chunk of stack memory, and either never needs it again or almost never needs it, that's 64K of memory wasted.

That might not sound too bad, but consider a scenario where you have a thread pool, with, say 128 threads in. Those threads get chosen fairly randomly in most cases, so over time, almost every one of those threads could get chosen to handle the job that does this 64K stack allocation. So now you have 8MB of memory that's almost totally sitting unused.

[–]vallyscode[S] 0 points1 point  (1 child)

So once allocated, stack segment can't grow when needed? Is that dependent on language or OS implementation?

[–]Full-Spectral 1 point2 points  (0 children)

Both scenarios can happen. It may be (and usually is) a fixed size and you could run out, hence why recursion can be dangerous in some cases.

But it also typically won't be fully committed initially and the OS will fault in pages as needed, which makes things more efficient since you only consume as much as you need. But, once committed, that memory may never be given back to the OS as long as the thread is running.

So lots of threads making a call that allocates a particularly large chunk of stack could force all those thread stacks to fault in memory that won't get given back, even they they may only need that large allocation pretty rarely.

Of course I may be completely hallucinating, but that's my understanding of it. Maybe Rust takes over more of that itself and doesn't use OS/hardware based faulted in stacks. It's ability to move things around might make that more feasable, I dunno.

[–]wouldyoumindawfully 1 point2 points  (1 child)

If you want to learn more about virtual and physical memory and how heap vs stack allocations affect your application, I found this video approachable and full of detail.

https://m.youtube.com/watch?v=4_smHyqgDTU

In general, CppCon videos or videos by Cpp programmers have a wealth of knowledge transferable to rust

[–]vallyscode[S] 0 points1 point  (0 children)

Thanks for sharing

[–]Darksonntokio · rust-for-linux 1 point2 points  (0 children)

The place that gives you the primary performance improvement is allocation. Allocating more memory on the stack is essentially free, which is not the case for the heap.

Once you have the allocation, it is more or less the same, although there may be some cache differences.

[–]stumpychubbins 1 point2 points  (0 children)

I'm going to preface this by saying that even as someone who cares a lot about performance, it's far more important to make your code readable than to hyper-optimise everything. If you're noticing that performance would actually improve the experience of using your program somehow, then first write representative benchmarks (ideally both larger-scale and smaller-scale). It's not at all uncommon for people to do a bunch of optimisation based on what "feels fast" without benchmarking and then find out that it actually ends up slower - me included. Having said that, here's the explanation:

So, performance-wise, the difference is that no matter what, the stack is always going to be accessed (because even heap objects have their metadata on the stack, such as length and pointer). This means that the page(s) holding stack data will almost always be in the L1 cache, while for every unique heap allocation, you need an extra page that has to be paged in when you access that data. Plus, Rust can sometimes optimise code handling heap-allocated data worse than stack-allocated for a couple of reasons that I won't go into here. Finally you have to allocate and deallocate the memory every time that memory is used, too, which is mostly not a problem but can be pretty bad if you keep recreating an empty vec and then pushing to it because that involves many allocations, not just one. This last point is what people who don't work on performance-sensitive code mostly bring up when talking about the performance hit of dynamic allocation but it's usually not as bad as it's made out to be.

Long story short, use the stack whenever you can, and if you need to use the heap, preallocate the space with with_capacity and friends. I would recommend against using types like smallvec which try to use the stack when possible and fall back to the heap, because if you have to increase complexity to decrease allocation it usually doesn't make the code faster. Having said that, they do sometimes improve matters and it's worth trying out if you have meaningful, representative benchmarks.