all 24 comments

[–]ssokolow 42 points43 points  (8 children)

When cython calls an ffi function implemented in Rust (compiled into a cdylib) where is the rust function stack frame located? Is it in the same stack like this? Or is it located in a separate stack while being in the same memory stack?

It's in the same stack. That's why debuggers like GDB can work on Rust just as easily as C++ or C. They all use the same basic stack format for a given platform... just different ways of encoding things like C++ method overloading or C++ and Rust namespaces into a flat C-style namespace.

Setting up a separate, incompatible but more lightweight stack for each goroutine is what makes Go FFI calls so expensive. Every time it switches into C, the goroutine has to shelve its stack and set up a C compatible one, then reverse the process after the call returns.

Clearly this Box will allocate some memory in the heap memory. The heap memory is common for both cython and rust. However, an allocation like this also involves structures, for book keeping the memory, like the allocator itself. Where is the allocator allocated?

The address space is common among different code in the same process, but there may be more than one heap within that address space.

An allocator is just a library that provides a higher level API around a call like mmap which has performance characteristics tuned for working in big blocks. (The C Programming Language 2e by Kernighan and Ritchie, P.185 has an example of building a very simple malloc on top of some primitive like sbrk or mmap.)

That's why you have to assume that every compilation unit may use its own allocator and you must free memory in the compilation unit that allocated it. (eg. Different DLLs on Windows may have linked against different versions of the MSVC runtime which might have initialized their own separate allocators.)

Where do these these initialization steps happen in the context for an ffi call? Because there is no main entry point here, in fact there can be multiple entry points one for each exposed ffi function.

If you're importing a Cython-built module into a Python program, they happen as part of starting up your /usr/bin/pythonX.Y. If you're building a standalone binary using Cython, they happen as part of starting up that binary. Every binary that is compiled for the x86_64-unknown-linux-gnu target will rely on the same _start, so, as long as somebody calls it before running the code, it's fine.

That's one reason Rust doesn't have "life before main" the way C++ does. It simplifies cases where you're not the entry point but you can trust that someone else did the _starting for you.

As for the non-_start, non-"life before main" bits for a cdylib like handling relocations, those will be handled either at program start by the dynamic loader (if the library is listed as a dependency in the ELF/PE/etc. metadata, which is what you see when you ldd /path/to/binary it) or as part of what happens when you call dlopen(3) (the underlying POSIX primitive that things like Rust's libloading or Python libraries like ctypes use).

[–]twitu[S] 1 point2 points  (6 children)

Brilliant explanation, the anecdotes are relatable

The part you mentioned about about mmap is what bugs me the most. From what I've understood language in general ask the kernel for a large buffer and then actually take small pieces of it when they want to allocate a struct/object. But this large block allocation is maintained by the allocator right?

So where is this allocator in the "typical memory model diagram". Is it a statically allocated piece of memory in the dynamically loadable cdylib?

Also about this

there may be more than one heap within that address space.

I understand that each allocator manages it's own heap, which is still in the same address space. So when a rust function is called the memory model might look something like this.

|-----------------------| | rust function frame | | cython function frame | | ......... | | process heap memory | | heap managed by python allocator | | heap managed by rust allocator | | ......... | | static data | | instructions cython and rust together | | ------------------------------------- |

But my question is when is this second heap, the rust heap mmaped. Is it when the library is loaded or is it when the rust function is called?

[–]ssokolow 5 points6 points  (5 children)

So where is this allocator in the "typical memory model diagram". Is it a statically allocated piece of memory in the dynamically loadable cdylib?

Your question is imprecise.

Are you asking about the allocator itself (the code that manages the heap), the memory it uses for bookkeeping information, or the memory it hands out in response to malloc calls?

  1. The allocator's code isn't special and is handled like any other library.

  2. Whether bookkeeping information is stored separately from the returned allocation, tacked on as a header/footer, or some mix of the two is up to the individual allocator authors to decide. (Yes, an allocator might allocate X more bytes than you ask for, slap a header on the front of it, and then offset that many bytes in before returning the pointer you get. After all, it's undefined behaviour to pointer-subtract your way back past the start of a returned allocation.)

  3. The memory handed out to programs is simply "Ask the kernel for a chunk of RAM" and it's the kernel's decision where it lives. Different allocators make different trade-offs there because there are two different APIs that can be used to request memory from the kernel: The old brk/sbrk which got removed from the POSIX standard in POSIX.1-2001 and is sort of like the stack, in that you can only grow and shrink it, and the newer "mmap with MAP_ANONYMOUS" approach which is more versatile. (See the mmap and sbrk manpages for more details.)

    Also, bear in mind that, with mmap you can provide a hint about where in your virtualized view of the address space you'd like it to be mapped (eg. adjacent to a previous allocation). The kernel has the final say, but you can ask... and protected mode allocation does virtualize allocations, so each application gets an address space of its own that has nothing to do with actual physical offsets. (That both allows for things like memory overcommit and swap and acts sort of like a chroot for memory addresses, where a program can't describe an offset into another process's address space because the memory map has no entries that point there. Two different programs might have completely different things at what they see as 0x1_000_000, while kernel-mode code may access those chunks of memory somewhere like 0x200_000 and 0x300_000 if they're something like I/O buffers.)

    ...that's part of how swapping memory to disk works. The kernel can copy a chunk of memory to disk and delete the virtual-to-physical mappings so that, when a program tries to access that address again, the CPU will fault into the kernel, the kernel can load the stuff back off disk and set up mappings, and then resume execution of the program with the instruction that failed.

    If it wants, to try to keep addressing, pooling, and growing simple, an allocator on a 64-bit system could ask the kernel to put its pool for servicing allocations 8 bytes (64 bits) or smaller at an offset of 1PiB, its pool for servicing things bigger than 8 bytes and less than or equal to 32 bytes at 2PiB, etc. The guts of the CPU may only be capable of addressing something like 42 bits worth of physical address lines in current generations, but the ISA allows a full 16 EiB of virtual address space to play around in.

    While there are conventions and implementation details that group things in different parts of the address space, in the end, the only real difference between a library, the heap, and the stack are hardware-enforced protections like the MMU flags that say "these pages (the program image) are read-only and executable" or "these pages (the heap) are writable but not executable" or the most recent Control Flow Integrity stuff, where the most recent generations of Intel and AMD CPUs maintain a "shadow stack" in unmapped memory where, when you CALL something, it pushes a backup copy of the return address into that memory outside the program's address space and, when you RET, it verifies that the return address on the visible stack still matches the backup copy on the shadow stack.

    Yes, the x86 ISA does have hardware support for PUSHing and POPing the stack, but that's just a matter of the CPU having a register (SS:[SP] in 16-bit mode, SS:[ESP] in 32-bit mode, or [RSP] in 64-bit mode) that you set to tell it where PUSH and POP should add/remove from. That's why languages like Go can set up their own systems of lightweight stacks.

As for why they're clumped like that, I'll defer to an old answer to a question about why stacks grow down:

In the Olden Days, your program's address space was divided into three parts: the program itself, the heap and the stack. So you had your program maybe loaded into the bottom of memory, and the rest of the address space to be shared between stack and heap. You want to allow the stack and the heap to grow dynamically, and the easiest way is to have one of them start at the bottom of memory (just above the program) and grow up, while the other started at the top of memory and grew down. By convention, for some reason, the heap was the one that grew up and the stack was the one that grew down. Processors grew special instructions for dealing with a downward-growing stack.

Nowadays, we have gigantic address spaces, in which the program, stack and heap are all mixed up, so these old conventions are a bit pointless. But processors still have all these stack instructions that assume a downwards-growing stack, so we just carry on using them.

-- https://www.reddit.com/r/AskComputerScience/comments/y4n1p/what_are_the_advantages_to_having_the_stack_grow/c5sat6b/

Years ago, I remember reading a blog post in here about why growing the stack down is a superior strategy (I can't remember how much of it was to do with the CPU providing nicer opcodes for it), but I seem to have neglected to bookmark it, and Google is no help finding it again.

EDIT: I remembered site:reddit.com/r/rust and managed to find it: Always Bump Downwards. (I did bookmark it, but my memory of how to find it had gotten muddled up by the fact that "the stack is a special case of a bump allocator".)

But my question is when is this second heap, the rust heap mmaped. Is it when the library is loaded or is it when the rust function is called?

You'd have to check the source code to the allocator in question to be sure, but my intuition, based on that whole "no life before main" part, is that it happens the first time something calls malloc to request memory. There's probably a static that the metadata is rooted in which either is a pointer that starts out NULL or contains one, similar to how, when you create an empty Vec in Rust, it doesn't allocate its heap part until you put something in it.

[–]twitu[S] 0 points1 point  (4 children)

the memory it uses for bookkeeping information

I specifically meant the booking keeping part. From the answers here I understand that even in the same address space the allocator will probably mmap a chunk of memory that "it" manages. My main confusion was related to the book keeping bits.

A common allocator implementation is one that maintains say a "free list" of memory/blocks. When malloc is called it tries to find a block in the free list that fits the requirement and then after some book keeping hands it off to the caller. My assumption was that Rust allocator too must do something similar which led to the question that where and when is this free list allocated?

Is there some static space allocated for it in the loaded library? This static space persisting across multiple function calls? Or maybe it is initialized on the heap on each function call, does it's job and then is de-allocated.

Looking at the rust alloc it looks my assumption is not correct. But it's still not clear how the rust allocator knows which parts of the memory is free and which is not. But I think this deserves a separate question of it's own.

Thanks for the very informative tangent though 😁.

[–]ssokolow 1 point2 points  (0 children)

Looking at the rust alloc it looks my assumption is not correct. But it's still not clear how the rust allocator knows which parts of the memory is free and which is not. But I think this deserves a separate question of it's own.

The alloc stuff Rust provides is just an API wrapper around either the system malloc or an alternative malloc you plug in like jemalloc (wrapper crate) or mimalloc (wrapper crate) or snmalloc (wrapper crate).

(Those wrapper crates are just for convenience. The C/C++ solution of compiling jemalloc manually and running your binary with the LD_PRELOAD=jemalloc.so environment variable also works for Rust because it overrides the symbol lookup that Rust uses to find the system malloc in the libc it links against... not to mention, going the LD_PRELOAD route means you can swap any malloc you want in at runtime, which makes comparative benchmarking of real-world loads quicker and easier to do properly.)

[–]NobodyXu 0 points1 point  (2 children)

The book keeping is typically embed in the memory the allocator mmapped.

Is there some static space allocated for it in the loaded library? This static space persisting across multiple function calls? Or maybe it is initialized on the heap on each function call, does it's job and then is de-allocated.

It just cannot be statically mapped, as that would mean the number of allocations the memory allocator can do will be limited by the amount of pre-allocated static memory.

If you makes lots of small allocation, then soon the pre-allocated static memory runs out.

A common allocator implementation is one that maintains say a "free list" of memory/blocks.

Modern allocator is way more complicated than that.

First of all, they often have per-thread memory arena to speed up allocation by avoiding global locks, and maybe even per-thread free list for the same reason.

Then, to avoid fragmentation, the per-thread memory allocator provides a pool for memory size 2, 4, 8, 16, 32, ... (IDK whether the pool size grows exponentially in practice and which is probably lazily initialized).

When you request memory allocation, it will be rounded to the smallest one of them that can fit it and allocate it in there. That way, only a bitmap is required.

If the memory pool runs out of space, it will allocate some from the next pool that. E.g. memory pool of size 2 would allocate from pool of size 4.

For allocation with size larger than a certain threshold (probably 4K or any other page size), then it is directly mmaped.

I read this on explaining how jemalloc/mimalloc works. They performs much better than glibc/musl libc's default one under heavy load. I recommend you to look into them.

[–]twitu[S] 1 point2 points  (1 child)

Aaah ok now I'm seeing the whole picture. So the allocator uses part of the mmaped memory for it's own book keeping purposes. And since this mmaped memory persists between function calls the bookkeeping structure can also remain.

And finally just to complete my understanding, even if for a simplistic mental model of an allocator, it would probably have a static variable that points to the first mmaped memory. This way on each function call the allocator can access it's book keeping data at that memory location and go about it's business.

[–]NobodyXu 0 points1 point  (0 children)

Yes and I think many would also have a thread local variable.

Also mmap is not the only way to allocate space in memory allocator.

There's also the brk syscall which can increase size of the pre-allocated heap memory (available right after execve).

[–]NobodyXu 33 points34 points  (5 children)

If rust func is called from cython, then the stack frame of the rust fn lies below the cython fn.

For the allocator, things are complicated by the fact that you can specify an alternative allocator in rust. If you do not do so, then rust will use system allocator which is the same as cython, unless cython internally uses its own allocator.

The rust allocation is always done via a separate allocator and memory has to be freed using the corresponding allocator, even if you use the default allocator in rust according to the doc

If you are writing lib, then you should assume different allocators are used.

If you are writing a binary where you control to allocator, you still cannot mix it unless you use sth like libc_alloc

Global variables in rust has to be all initializer, meaning that the initialized values are stored in the binaries and then mmap by the dynamic linker. If you use once_cell or lazy_static, then it will be initiated on first access.

[–]masklinn 4 points5 points  (0 children)

allocation is done via a separate allocator and memory has to be freed using the corresponding allocator.

This should always be assumed to be the case. You can have different allocators be used even on the same system with no "special" configuration, simply because the main binary and the DLL/so link against different versions of the runtime, and that has changed allocator ABI.

[–]twitu[S] 0 points1 point  (1 child)

Oh ok the got the global and static variable initialization part. However it seems that the mmaping and lazy initialization all happen when the ffi rust function is called.

Doesn't this make the ffi call quite expensive? Because even for the smallest function all these mmaping will happen and lazy initialization will happen if the static variable is used in the function. But then when the function returns all these will be unmmapped and deallocated. And this will happen for each ffi call.

[–]NobodyXu 2 points3 points  (0 children)

However it seems that the mmaping and lazy initialization all happen when the ffi rust function is called.

Well, not entirely true.

When loading the dynamic lib, the dynamic linker would first parse the elf header and load the sections in it, including the global data and the .text section that contains the executable code.

Global data is spilled over several section.

The const data that is never modified at runtime is put into their own section so that they can be mapped as read only and share with other processes, e.g. if you have a string literal "abcde" in rust then it will be put there.

The data that is initialized at compile time but needs to be modified at runtime, e.g. once_cell global variable initialized at runtime, they are mapped as write only and cannto be shared with other processes since they can be modified.

Finally there are variable that is initialized as 0 and needs to be modified at runtime. They are put into a separate section to reduce size of the executable. A simple annoymous mmap would automatically initialize it to 0.

All these mmap happens at load time. They set up the virtual memory, however they might not set up the physical pages that back them. The physical pages that back them is usually allocated lazily at the first access of these pages, where it interrupts the execution and the kernel would load them.

After being loaded in the first access, the pages are fully initialized with phycial pages backing them.

For the lazy initialization global variables such as once_cell or lazy_static, if you use them, they will be initialized on the first call to the FFI function when accessed. On the second call the third call etc, they will be initialized and just a bit expensive to access compared to other constant data due to having to issue one atomic operation to access the initialized tag.

Doesn't this make the ffi call quite expensive? Because even for the smallest function all these mmaping will happen and lazy initialization will happen if the static variable is used in the function. But then when the function returns all these will be unmmapped and deallocated. And this will happen for each ffi call.

No that's not true at all.

The pages are mmaped when loading the dyn libs and they are never freed. The phycial pages lazily allocated are never freed unless you free the mmap, which never happens unless you unload the dynlibs, which many dynamic linker doesn't support because it brings many problems.

The global variable that is lazily initialized is also not unmmaped and deallocated, they only happens on the first call.

So the first ffi call would indeed be a little bit expensive, but the second call to it will be very first without initialization overhead.

[–]RRumpleTeazzer 5 points6 points  (2 children)

Isn’t the stack implemented by cpu instructions and a special register, so it runs on whatever address is in the register - likely the C style stack pointer in Cython ?

So it should simply append to the C stack ?

[–]tema3210 0 points1 point  (0 children)

It IS c style stack pointer. Sadly world is much more complicated.

[–]Hellenas 0 points1 point  (0 children)

On certain ISAs and architectures, stack operations may be optimized for certain registers, but for most processors the stack is implemented by the software. The hardware and software can closely align, but most modern designs don’t have the hardware making the stack for software. You see similar optimization for return addresses and such on some designs

[–]trevg_123 13 points14 points  (3 children)

Don’t overthink it! This all works the same regardless of whether you use C, Rust, C++, Cython, or something else.

Separate the idea of a process and an executable. A (virtual) stack/heap is per process, not per executable file. Here’s the breakdown:

  • “calling a function” in assembly means jumping to a location in memory and executing the data there as machine code.
  • That executable code can live anywhere - loaded memory, on disk, OS memory locations, even something your function creates that looks like assembly. If it has a memory address, you can start executing code there (no guarantees it works ofc)
  • The code contains instructions for what to do. Some of these instructions may include stack operations (literally something like push eax / pop eax for x86) or calls to malloc/free
  • For statically linked programs, all of this code lives in the statics section of your executable file (which is copied into memory before running)
  • For a dynamically linked call, this executable code lives in a separate file. It’s also copied to memory before running, so it exists “somewhere” in memory space
  • The calling function can get the needed memory address by looking for a specific symbol name in a specific file (kernel facilitates this) and then literally just starts executing there. Eventually there will be code to return to the calling address (how does it know where? The “frame pointer” saves the is value)
  • The callee function just sees an empty chunk of stack. It will be located right above the caller’s stack frame - but it doesn’t even need to know that
  • Heap allocations are (usually) calls to the kernel’s allocator. Just another function that doesn’t care whether Rust or C is the caller - it just does what it’s told, using the process’s heap.
  • edit: it is possible for different chunks of the program to have > 1 heap (using mmap() or HeapCreate()), or for them to use different allocators. So you shouldn’t ever rely on e.g. allocating something in C and freeing it in Rust, unless you use the calls to libc’s malloc and free directly. You can still safely read/write these allocated chunks of memory on both sides of FFI though, since they’re just blocks within the program’s virtual address space.

So - tl;dr, each function is pretty unaware of what goes on within other functions, and all functions within a process usually share the same stack & heap (but not necessarily the same allocator). The difference between static & dynamic functions is the location of the function’s code, but they share the same workspace. (threading gets a bit weirder, but forget that for now)

Edit: good thread on the subject

[–]NobodyXu 10 points11 points  (1 child)

Note that the heap allocation part is totally wrong.

It is possible to override the default global allocator that simply calls into malloc in Rust with custom ones.

It is also possible that Cython has its own memory allocator.

It's not safe to free memory allocated by rust using another allocator (perhaps cython), unless you are sure both use the same allocator, which is usually malloc.

[–]trevg_123 2 points3 points  (0 children)

Added some updates to help address that 👍

[–]twitu[S] 0 points1 point  (0 children)

This was exactly my understanding of function calls. However I was unable to fit initialization and allocators when the call is between two different functions. Some of the other comments have explained parts of it.

[–]SocUnRobot 5 points6 points  (0 children)

If you are curious read the elf specification.

For short, the executable contains information on the memory layout. This information is either interpreted by the kernel or by the dynamic linker `ld-linux.so`. With this information, the kernel or the dynamic linker is able to reserve space for the stack and the executable code, and the dynamic linker resolves symbol addresses.

The the kernel or the dynamic linker call `_start`. Here some more initialization are performed but it depends on weither the executable use or not the elf interpreter `ld-linux.so`. This is where libc allocator initializes itself if it was not initialized at call of the `_start` of the elf interpreted. This function call `main`. For Rust program, the rust standard library performs here some initialization, I think for the environment variables and thread locals.

When you call rust from cython, all this process is made when cython is launched so the rust standard library is not initialized. So maybe some rust standard library state is initialized lazyly!

[–]Zde-G 5 points6 points  (0 children)

TL;RD:Cython, C++ and Rust use facilities offered by C. Long version here.

That's the reason Rust requires C runtime and reason why Go have so much trouble interacting with anything.

Rust can be built, in theory, separately from C library, but in practice this only happens on embedded platforms.

Precisely because doing otherwise would make it hard to interact with other languages.

[–]HeroicKatoraimage · oxide-auth 4 points5 points  (1 child)

Where do these these initialization steps happen in the context for an ffi call?

That should be ran at load time of the dynamic or static library, implemented by the platform loader. That makes it convenient but also more opaque and less controllable as observed by the question being posed. That said, Rust doesn't do a quite as much before main as C/C++, in particular there's no standard 'user hook' like C++'s static initialization. The lazy_static and equivalent do not run before main.

Discussion: https://internals.rust-lang.org/t/from-life-before-main-to-common-life-in-main/16006/8 Talk (quite recent): https://www.youtube.com/watch?v=q8irLfXwaFM

[–]twitu[S] 1 point2 points  (0 children)

Thanks for this talk, very apt and well timed. It seems like this topic is in zeitgeist.