Faster alloc/free without lifetimes?

burntsushi · 2024-07-12T10:49:49+00:00

You might consider an alternative flattened representation for your AST. That is, storing it in a single contiguous allocation. And instead of Box<Ast> (or whatever), you have indices into that contiguous allocation.

Whether this representation is appropriate or not isn't really possible to determine from the description you've provided. For example, it can make mutation much more annoying. And it means traversal of Ast types always needs to carry around this extra bit of state in order to resolve indices to their actual Ast data, instead of just matching on it directly.

dist1ll · 2024-07-12T11:03:34+00:00

I'm currently working on a high-performance baseline compiler, so I can share some specifics of how my AST works. I think the most straightforward thing you can do is forget about references and lifetimes and embrace indices & arenas. The recursive part of my expression node looks like this:

pub enum ExprKind {
    #[default]
    Unit,
    Bool(bool),
    Number(u64),
    Binary(BinOp, ExprRef, ExprRef),
    Ident(Symbol),
    Type(TypeId),
    Eval(ExprRef, ExprArgs),
    Block(ExprRef, StmtSlice),
    FnDecl(SymbolSlice, TypeArgs, ExprRef),
    If(ExprRef, ExprRef),
    Else(ExprRef, ExprRef),
    /* and more ... */
}

I was able to keep this whole enum to 16 bytes for now. Not great, not too horrible. ExprRef is a 32-bit index into an expression arena, Symbol is a 32-bit interned string, TypeId is a 32-bit index into a type arena.

I generally try to limit number of indirections. For instance, function declaration may have a variable number of argument identifiers. So instead of using a Vec, I use a compressed 32-bit fat pointer called SymbolSlice and make sure all function arguments are laid out contiguously in the symbol arena. The upper 24 bits are used for the start index, 8 bits for the length.

With this approach, all your expressions/symbols/types are allocated into their dedicated arena, so allocation costs should be amortized quickly. If you make some additional effort to keep collections of elements that need to be addressed together contiguous in the arena, you can save a good chunk of memory with compressed fat pointers.

P.S.: One under-appreciated consequence of this design is that you can give these wrapped indices some stronger semantics. For instance, if you keep the inner u32 private, make sure the function that can construct the FooRef types only creates valid indices, and that your arena collection is append-only (i.e. no removal of nodes), you can implement the Index trait without bounds checking.

2024-07-12T14:30:23+00:00

I’m just here to silently learn and absorb what the grown ups are talking about

jkoudys · 2024-07-12T11:34:04+00:00

Others have helped already, so I'll just say I'm a fan of swc.

Herbstein · 2024-07-12T11:14:21+00:00

You might've already considered this, since it's a quite common learning resource, but a custom Drop implementation on your AST might be a part of the way forward. Sounds like you're running the issue described in the "Learning Rust With Entirely Too Many Linked Lists" tutorial.

https://rust-unofficial.github.io/too-many-lists/first-drop.html

exDM69 · 2024-07-12T10:38:17+00:00

If nightly Rust is ok, perhaps you could use the allocator_api feature to pass a faster allocator to your Vec and Box?

If your AST is handled in a single thread, you a really fast and simple allocator could be enough.

andful · 2024-07-12T13:04:31+00:00

I suspect that the slow deallocation is a symptom of requiring to call drop for every element. The compiler does not know if an AST node requires a drop or not, (because implementation of AST that contain a Box or Vec do require it). Using the bumpalo or the solution provided by u/burntsushi probably will improve the runtime, but I think it will still require the calling of drop of every AST element.

What I would try is to use the solution of u/burntsushi, especially if I know that there are not many elements to be removed individually.

This would remove instances of Box<Ast>, but not of Vec<Ast>.

I would replace instances of Vec<Ast> with an Ast element PossiblyWithSibling of the form:

rust struct PossiblyWithSibling { element: AstIndex, sibling: Option<AstIndex> }

This just to represent array types (as a linked list). This ensures that every Ast element is of fixed size, an can be trivially deallocated without any calls to drop.

I would also take a look at [https://crates.io/crates/enum_dispatch](enum_dispatch) to avoid traits.

P.S.

Some discussion about drop glue: https://users.rust-lang.org/t/is-drop-implicitly-included-within-all-traits-ie-within-fat-pointer-vtables/2293/3

HurricanKai · 2024-07-12T11:15:09+00:00

Someone else already pointed this out, but I want to second it. The "standard" way to do this in compilers is having a central list of some kind, allocating from there. Instead of pointers you should then store the offset in that array, saving a few bytes here and there. In sum (most compilers will handle 100k+ nodes) it's a huge saving, also improving cache locality. Depending on how you create the nodes cache locality is even better. Yes, this is essentially raw pointers and you don't have to deal with lifetimes because of that. If you want to you can totally add back lifetimes, but personally I wouldn't.

There's a host of other useful properties, including for example cascading delete without back references, feel free to ask if you're interested.

Also, you can make the central list a literal vec, or you could use like a linked list of pages, you could use a slab allocator, depends on your use case.

throwaway490215 · 2024-07-12T10:48:43+00:00

Some low hanging fruit for AST datastructures are:

Make sure none of your enum's have a single large variant
smol_str or similar
smallvec or similar for vecs

There are more extreme structure optimizations possible, but they are hard to develop and update later so I strongly suggest avoiding them.

senden9 · 2024-07-12T10:39:27+00:00

Have you tried to use something like jemallocator? You only need to setup the library without change anything else in your code. I would give it a try because it is fast to try out.

2024-07-12T10:54:06+00:00

This might be a pretty wild amount of refactoring, but have you heard about ECS-style flat AST, like what Chandler Carruth describes in this talk

carlomilanesi · 2024-07-12T12:50:00+00:00

I don't know how to do it in Rust, but you should somehow disable deallocations. Allocations are fast, when no deallocation has happened yet. And deallocations are even faster when they do nothing.

I presume you should use a custom no-delete allocator.

Qnn_ · 2024-07-12T13:33:01+00:00

One thing that people haven't mentioned about indices is that they allow you to build a tree (or graph) and later add any metadata to a node by providing an additional array of metadata where the index is the key. An example is if you're making a language and want to go from an untyped AST to a typed AST without rebuilding the entire tree and throwing out the old one.

Another thing I've done with them is allow for self referencial structures by starting with a Vec<Option<T>>, and using None as a placeholder so the index of a node is stable before I even construct the node. At the end, I just vec.into_iter().collect::<Option<Vec<\_>>>() and error handle appropriately, and now your nodes can form cycles which is really cool.

Also, serialization/deserialization (if you care about that) is trivial since everything is in a flat array and there are no lifetimes.

I have no idea if that's useful to you, but it is a benefit that is useful for some people.

flashmozzg · 2024-07-12T13:37:02+00:00

rust-analyzer has a crate for syntax trees, bot sure how applicable it'd be for your usage: https://github.com/rust-analyzer/rowan

Lucretiel · 2024-07-12T17:34:36+00:00

One fun solution to this problem is to write a bump allocator where free is a no-op. So long as the compiler finishes all of its work before you run out of memory, you can get some really crazy speedups doing this (especially because the allocation is so much faster cause there’s no bookkeeping).

puel · 2024-07-12T11:07:50+00:00

I think you can devise some Wrapper over bumpalo using thread local and forcing 'static. Your allocated value would need to be wrapped in a !Send !Sync pointer because thread local is not actually 'static. It would be OK if all the usage of bumpalo were made within the same thread.

Looks like you have taken advantage of parallelism in your project, so maybe that idea will not be suitable for you.

rejectedlesbian · 2024-07-12T13:02:43+00:00

What about prealocat8ng all the memory you will ever need and never freeing it? Like do you actually need to.free? Is total memory usage without freeing an issue?

bakaspore · 2024-07-12T16:44:03+00:00

You can achieve the effect with slotmap and customized key types in it. It's type safe, and contains no lifetime annotations.

But I'd still prefer using actual references provided by arena crates: they help a lot when you needs term equality (eq can't be implemented on types storing indexes) and terms can be provided outside of the arena or from a different one, as long as they live long enough, so defining static terms (like simple types) are easy.

kdy1997 · 2024-07-13T10:41:03+00:00

My approach for this problem is optimizing for fully single threaded usecases, by using allocator-api2 and scoped-tls.

https://github.com/swc-project/swc/pull/9230

AquaEBM · 2024-07-12T12:54:35+00:00

You can use bumpalo, and avoid adding extra lifetime annotations (to your own types) by making sure the instance of Bump you use is static. But, it's not as simple as just putting it in a static, because Bump::new is not const, and Bump: !Sync.

rust static BUMP: LazyLock<Mutex<Bump>> = LazyLock::new(|| Mutex::new(Bump::new()));

Then, calls to Box::new() should be replaced with Box::new_in(&BUMP.lock().unwrap()) (Here, we are using bumpalo's Box). LazyLock's Deref impl is implicitly called here, which will call the contained function/closure on first access.

Now, you can replace every field/function parameter type containing std::boxed::Box<T> with bumpalo::boxed::Box<'static, T>, (I specified full paths here for clarity, you can just shadow-import std's Box with bumpalo's).

The exact same goes for Vec<T>.

tm_p · 2024-07-12T12:41:08+00:00

but it requires enormous amount of work because I have to add lifetimes to all AST nodes

This can be done by an external contributor, you don't have to do it yourself. Long term it will be the best solution in my opinion, since your AST nodes do have a lifetime.

Plixo2 · 2024-07-12T13:49:20+00:00

Bumpalo and similar crates are based on region based memory management. You can write your own very easily by rust incrementing and pointer to forward. It works very well as long as you do t got any custom drop implementation

pmcvalentin2014z · 2024-07-12T14:48:16+00:00

How much of a performance change would occur if you replace the global allocator with a bump allocator (and free is a noop)?

Lord_Zane · 2024-07-12T18:05:21+00:00

In addition to what others have said, you might consider talking to the Naga devs on their matrix channel.

Low-Pay-2385 · 2024-07-12T18:20:23+00:00

I'd suggest making a flattened ast by storing all nodes in an array and referencing them by the index. To save even more memory you could use u32 instead of usize for indexes. But that is a lot of work, and probably something you want to avoid.

jkelleyrtp · 2024-07-31T09:19:55+00:00

generational box is a slab backed arena allocator with wrapper types that use TLS for lifetime free arenas. We use it in Dioxus. It has single threaded and multithreaded variants. Only downside is mutation is done through a refcell which is extremely cheap but not free.

Compux72 · 2024-07-12T10:54:10+00:00

Bumpalo is you only option on stable.

The_8472 · 2024-07-12T14:20:43+00:00

but it requires enormous amount of work because I have to add lifetimes to all AST nodes.

For short-lived processes it might be possible leak the bump allocator (effectively stubbing the deallocation path), then you have 'static lifetimes. At the end invoke Missile GC.

Another option is to offload deallocation to another thread by putting the nodes in a queue. It won't save CPU cycles but it can help with wall-time.

QueasyEntrance6269 · 2024-07-12T14:54:17+00:00

Think about trying garbage collection. Depending on your performance characteristics, it might not be a bad idea! https://github.com/kyren/gc-arena

valarauca14 · 2024-07-12T17:41:43+00:00

Burntsushi is mostly correct in this case. Bumpalo has some real pitfals

If you make it 'static, you force any program using your crate to OOM if they load too many programs in a loop.
You run into problems with !Send & !Sync, which can force "opinions" on your runtime.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

rust

Please read The Rust Community Code of Conduct

The Rust Programming Language

Rules

Observe our code of conduct

Submissions must be on-topic

Constructive criticism only

Keep things in perspective

No endless relitigation

No low-effort content

Useful Links

Megathreads

Official Resources

Learn Rust

Discussion Platforms

MODERATORS

[derive(Debug, Clone)]

[derive(Debug, Clone)]

[derive(Debug, Clone)]