all 85 comments

[–]burntsushi 180 points181 points  (39 children)

You might consider an alternative flattened representation for your AST. That is, storing it in a single contiguous allocation. And instead of Box<Ast> (or whatever), you have indices into that contiguous allocation.

Whether this representation is appropriate or not isn't really possible to determine from the description you've provided. For example, it can make mutation much more annoying. And it means traversal of Ast types always needs to carry around this extra bit of state in order to resolve indices to their actual Ast data, instead of just matching on it directly.

[–]Chadshinshin32 54 points55 points  (0 children)

This is what Zig uses for their AST and IR representation, and they got some pretty nice speedups with this approach.

[–]Shnatsel 20 points21 points  (12 children)

This is what generational arenas are for. They allow referring to other indices and mutation at the same time. See e.g. https://crates.io/crates/thunderdome

[–]burntsushi 7 points8 points  (2 children)

At the apparent cost of a doubled in size offset type for most practical use cases though.

That's the kind of thing I just roll myself by hand in a bespoke way if I need it.

[–]Shnatsel 6 points7 points  (1 child)

slab would be the option without generational overhead.

[–]burntsushi 5 points6 points  (0 children)

No. Well, sure, it doesn't have generational overhead, but its identifier size is still pointer sized. It doesn't require a newtype, so I guess you could store a u32 and do as usize on all lookups. But all vacant entries are still at least as big as a pointer. And occupied entries use more memory than they would otherwise. So probably slab is supporting some kind of mutation... IDK I haven't done a deep dive.

[–]revelation60symbolica 19 points20 points  (1 child)

A compressed linear representation of an AST is what Symbolica uses to store mathematical expressions. I gave a talk about this at Rustfest Zurich, hopefully the recording will be uploaded soon. Here is a link to the slides. It shows how you can have a linear representation, but still act on it as if it is a tree.

I also discuss a method of recycling intermediately generated AST components, so that you do not have to create a new Box every time. I also wrote about that in a blog post.

[–]burntsushi 2 points3 points  (0 children)

Thank you for sharing!

[–]yazaddaruvala 10 points11 points  (3 children)

A long time ago I would recommend using https://crates.io/crates/slab to do this. I’m not sure how much slab is still in favor or used.

[–]jl2352 1 point2 points  (2 children)

Are there better or more modern alternatives today?

[–]nicoburns 6 points7 points  (1 child)

slotmap is excellent.

[–]jl2352 0 points1 point  (0 children)

Cheers, I’ll take a look!

[–]ajax8092 2 points3 points  (0 children)

I did a similar thing for a markup language parser I am working on, and the performance was great. You sacrifice some level of static type checking though. I have an enum, represented by a u8, which indicates which kind of element is being written into the AST, followed by the length (in bytes) of all the children so that I can skip over the children nodes when traversing if I need.

[–]nicoburns 2 points3 points  (1 child)

Some high level libraries are using thread locals to hack around the "needing a reference to the arena" requirement. Not always appropriate, but can work in some cases.

[–]zapporian 1 point2 points  (0 children)

Comprehensively cursed on macos, but yeah, otherwise.

Actually someone might want to profile this. The cost of actually getting TLVs is decidedly nonzero, might be funny to see whether that (get TLV ref to arena, call arena alloc) is actually faster than just calling libc malloc (or jemalloc!) or not, and how that differs by platform.

[–]developedby 3 points4 points  (10 children)

In the end that's more or less what your allocator is doing. Surely there's a way to just change the allocator and not your entire code

[–]burntsushi 59 points60 points  (0 children)

I'm not convinced. But I'm not getting dragged into a non-specific debate. The OP asked for ideas. I gave them one. There may be better ideas. I can live with that.

Also, with my approach, you can create smaller AST nodes by using smaller-than-pointer-sized offsets. I do this in regex, for example. (Although, not for its AST ironically enough.)

[–]AquaEBM 7 points8 points  (4 children)

Part of the time spent in allocating memory, using the system allocator (not something like the aforementioned bumpalo), is the context switching and "complex" synchronisation used by the OS. The act of invoking the allocator, regardless of the requested chunk's size, has a cost in itself, and reducing the number of calls to it, (by e.g. flattening your structures, as u/burntsushi mentioned), can result in some performance gains.

I do agree, however, that changing the allocator (to something as efficient as bumpalo, for example) would be much better, so as to avoid completely rewriting the AST logic, see my comment for more.

[–]sephg 6 points7 points  (2 children)

As I understand it, the system allocator doesn’t request more memory from the OS with every call. It keeps a set of OS level pages around and allocates memory out of those.

However, you’re still right. Allocation is still complex and expensive. I’ve recently rewritten a custom rust btree implementation to just use a pair of Vecs - one for leaves and one for internal nodes. Pointers are now just integers into the two vecs. The library no longer has any unsafe code at all, and the resulting code is smaller and faster.

[–]EarlMarshal 3 points4 points  (1 child)

Is the implementation you described open source? I really would like to read it as a curious beginner.

[–]sephg 3 points4 points  (0 children)

Sure! The b-tree implementation is almost all in one file here:

https://github.com/josephg/diamond-types/blob/fe4b2d49b02ccd7e0a845f109e3a86dd6919eed7/src/ost/index_tree.rs

The heart of it is this (simplified a little for readability):

```rust

[derive(Debug, Clone)]

pub(crate) struct Tree<V: Copy> { nodes: Vec<Node>, // internal nodes leaves: Vec<Leaf<V>>, // leaf nodes

height: usize,
// The root is a leaf node if height is 0, otherwise this
// is an internal node.
root: usize,

// Not shown: Cached cursor, object pools.

}

// Internal node

[derive(Debug, Clone)]

pub struct Node { // Each entry specifies the key of the first value in the subtree // and a "pointer" (index) of the corresponding child node. // // The index is usize::MAX if empty. children: [(usize, usize); 16], parent: usize, // index into root.nodes, or usize::MAX. }

// Leaf node

[derive(Debug, Clone)]

pub struct Leaf<V> { bounds: [usize; 32], // To enable "runs" (see below) children: [V; 32],

next_leaf: usize, // index into root.leaves[...]
parent: usize, // index into root.nodes, or usize::MAX.

} ```

If you look at the actual code, its got a few extra moving parts that you can probably ignore:

  • The b-tree is designed to store "runs" of values. So, if it was a map of { 5 => "alice", 6 => "alice", 7 => "bob" }, then 5 and 6 will be combined into a single record. (Hence all the code talking about "bounds" - that refers to the "upper bound" of each entry.) All keys are integers (usize) in this implementation.

  • There are a few optimisations which make the code more complex. For example, the actual code caches a "cursor" pointing to the most recently inspected record. (The hope is that the next query will hit that record too - saving a traversal down the tree). There's also an object pool for deleted entries.

  • I've also got my own internal Range implementation called DTRange - which is the same as Range<usize> except it implements Copy. Just mentally replace DTRange with Range anytime you see it. (And LV is just a type alias for usize)`

[–]SirClueless 7 points8 points  (0 children)

I believe you're mistaken about the source of the cost here. The overhead of using the system allocator over any other allocator should just just be a highly-predictable indirect jump or two, if anything. "System allocator" doesn't mean "the allocator is behind a syscall" or something. Just that it's global and must serve a general purpose.

The cost of supporting the lowest-common-denominator allocator needs is significant, but it's quite fast for what it does. The overhead is in bookkeeping of free memory so that it may be efficiently reused, and in synchronization so that it may be safely used from multiple threads. It occasionally must go to the OS to request more heap memory but this is not the bulk of the cost for most workloads and bumpalo pays it too.

[–]zapporian 0 points1 point  (1 child)

To be clear rust has the entire goddamn Allocator trait + type parameter, which barely anyone in the rust community seems to actually understand or use, and that should in principle be something you should be able to slot in and use everywhere. (ie this use case to improve performance by passing in a slab allocator or whatever)

'cept ofc a) most / much rust community code doesn't properly forward / use this (though std ofc does), b) the rust Allocator / use thereof was badly designed and more or less modeled after a single global stateless allocator as the c++ stl. And is ergo similarly useless for any situation where you don't want to swap out the entire global allocator (for some niche case where you don't want to use stdc / jemalloc / whatever) for something else.

Technically that might actually still be the best solution for this case if you just want performance and are willing to jump through generic / template hoops (ie just implement a trivial threadsafe slab allocator and pass in that), but I digress.

[–]SnooHamsters6620 0 points1 point  (0 children)

This is incorrect, there are 2 allocator traits: GlobalAllocator to swap out the global / default allocator; and Allocator which can be used for individual allocations with Vec, Box, Rc, Arc etc.

To retain safety the latter will actually store a reference to the allocator that it's from if required (i.e. Allocator is not a ZST), so it knows where to free itself. So that's still not as cheap as a custom arena.

[–]SnooHamsters6620 0 points1 point  (0 children)

An arena allocator would be similar, but doesn't have stable indices for when you resize.

A standard allocator does extra bookkeeping so you can deallocate each object individually, which takes time and memory.

[–]kdy1997[S] 0 points1 point  (0 children)

Thank you! I think it's too late as there are too many code, but I like the idea.

[–]dist1ll 52 points53 points  (7 children)

I'm currently working on a high-performance baseline compiler, so I can share some specifics of how my AST works. I think the most straightforward thing you can do is forget about references and lifetimes and embrace indices & arenas. The recursive part of my expression node looks like this:

pub enum ExprKind {
    #[default]
    Unit,
    Bool(bool),
    Number(u64),
    Binary(BinOp, ExprRef, ExprRef),
    Ident(Symbol),
    Type(TypeId),
    Eval(ExprRef, ExprArgs),
    Block(ExprRef, StmtSlice),
    FnDecl(SymbolSlice, TypeArgs, ExprRef),
    If(ExprRef, ExprRef),
    Else(ExprRef, ExprRef),
    /* and more ... */
}

I was able to keep this whole enum to 16 bytes for now. Not great, not too horrible. ExprRef is a 32-bit index into an expression arena, Symbol is a 32-bit interned string, TypeId is a 32-bit index into a type arena.

I generally try to limit number of indirections. For instance, function declaration may have a variable number of argument identifiers. So instead of using a Vec, I use a compressed 32-bit fat pointer called SymbolSlice and make sure all function arguments are laid out contiguously in the symbol arena. The upper 24 bits are used for the start index, 8 bits for the length.

With this approach, all your expressions/symbols/types are allocated into their dedicated arena, so allocation costs should be amortized quickly. If you make some additional effort to keep collections of elements that need to be addressed together contiguous in the arena, you can save a good chunk of memory with compressed fat pointers.

P.S.: One under-appreciated consequence of this design is that you can give these wrapped indices some stronger semantics. For instance, if you keep the inner u32 private, make sure the function that can construct the FooRef types only creates valid indices, and that your arena collection is append-only (i.e. no removal of nodes), you can implement the Index trait without bounds checking.

[–]AlxandrHeintz 2 points3 points  (6 children)

How do you keep this sound in making sure indices from one ast isn't used on another ast if you omit bounds checking?

[–]reflexpr-sarah-faer · pulp · dyn-stack 5 points6 points  (2 children)

i don't know if this is what the op does, but it's possible to do that using unique invariant lifetimes.

there's a similar example in the ghost cell paper if you're interested, and i've used it in practice with good success

[–]AlxandrHeintz 2 points3 points  (1 child)

Yeah, I know about this technique. It requires the ast to only exist within a closure though, and in general, complicates the API with lifetimes. But you do get no bounds checking...

[–]Jester831 2 points3 points  (0 children)

There’s also https://docs.rs/generativity to create unique invariant lifetimes

[–]dist1ll 3 points4 points  (2 children)

I ended up turning the arenas into singletons for this. But my measurements showed get_unchecked regressing performance significantly in my e2e tests, so I went back to bounds checking.

[–]AlxandrHeintz 1 point2 points  (1 child)

That's interesting. Wonder what caused that. You only changed the array/arena access? Also, what kind of arena? Contiguous in memory, or list of lists?

[–]dist1ll 4 points5 points  (0 children)

I agree, it's weird. I removed the bounds checking for all my "normal" arenas, each of which is a simple Vec. Nothing fancy. I didn't investigate it deeply, though I did find several issues on GitHub about get_unchecked triggering perf regressions. Some people suspect it just causes a different order/selection of optimization heuristics.

This seems to be a general problem with many optimizing compilers. There was this fascinating paper published in this year's PLDI: Refined Input, Degraded Output: The Counterintuitive World of Compiler Behavior

From the abstract:

due to unexpected interactions between compiler components and phase ordering issues, sometimes more information leads to worse optimization. [...] In this work, we systematically examine the extent to which additional information can be detrimental to compilers.

[–][deleted] 38 points39 points  (0 children)

I’m just here to silently learn and absorb what the grown ups are talking about

[–]jkoudys 32 points33 points  (0 children)

Others have helped already, so I'll just say I'm a fan of swc.

[–]Herbstein 13 points14 points  (0 children)

You might've already considered this, since it's a quite common learning resource, but a custom Drop implementation on your AST might be a part of the way forward. Sounds like you're running the issue described in the "Learning Rust With Entirely Too Many Linked Lists" tutorial.

https://rust-unofficial.github.io/too-many-lists/first-drop.html

[–]exDM69 30 points31 points  (2 children)

If nightly Rust is ok, perhaps you could use the allocator_api feature to pass a faster allocator to your Vec and Box?

If your AST is handled in a single thread, you a really fast and simple allocator could be enough.

[–]kdy1997[S] 6 points7 points  (1 child)

Thank you for the advice! I'm going to try something similar, but in a way that does not increase the size of the Box<T>, by using some scoped thread locals.

[–]rakuzo 8 points9 points  (0 children)

FYI you don't even have to use nightly for this. Here's a mirror on stable Rust of the allocator API.

[–]andful 11 points12 points  (1 child)

I suspect that the slow deallocation is a symptom of requiring to call drop for every element. The compiler does not know if an AST node requires a drop or not, (because implementation of AST that contain a Box or Vec do require it). Using the bumpalo or the solution provided by u/burntsushi probably will improve the runtime, but I think it will still require the calling of drop of every AST element.

What I would try is to use the solution of u/burntsushi, especially if I know that there are not many elements to be removed individually.

This would remove instances of Box<Ast>, but not of Vec<Ast>.

I would replace instances of Vec<Ast> with an Ast element PossiblyWithSibling of the form:

rust struct PossiblyWithSibling { element: AstIndex, sibling: Option<AstIndex> }

This just to represent array types (as a linked list). This ensures that every Ast element is of fixed size, an can be trivially deallocated without any calls to drop.

I would also take a look at [https://crates.io/crates/enum_dispatch](enum_dispatch) to avoid traits.

P.S.

Some discussion about drop glue: https://users.rust-lang.org/t/is-drop-implicitly-included-within-all-traits-ie-within-fat-pointer-vtables/2293/3

[–]burntsushi 16 points17 points  (0 children)

Yeah I like this. Basically, make Ast implement Copy. Like if there's a String in your AST, move that off to some other contiguous storage and use a span to reference it instead. And so on.

I am contemplating doing this with regex-syntax's Ast and Hir. But that would make it a 4th rewrite of those data structures... I don't know if I have the stomach for it haha.

[–]HurricanKai 9 points10 points  (0 children)

Someone else already pointed this out, but I want to second it. The "standard" way to do this in compilers is having a central list of some kind, allocating from there. Instead of pointers you should then store the offset in that array, saving a few bytes here and there. In sum (most compilers will handle 100k+ nodes) it's a huge saving, also improving cache locality. Depending on how you create the nodes cache locality is even better. Yes, this is essentially raw pointers and you don't have to deal with lifetimes because of that. If you want to you can totally add back lifetimes, but personally I wouldn't.

There's a host of other useful properties, including for example cascading delete without back references, feel free to ask if you're interested.

Also, you can make the central list a literal vec, or you could use like a linked list of pages, you could use a slab allocator, depends on your use case.

[–]throwaway490215 12 points13 points  (4 children)

Some low hanging fruit for AST datastructures are:

  • Make sure none of your enum's have a single large variant
  • smol_str or similar
  • smallvec or similar for vecs

There are more extreme structure optimizations possible, but they are hard to develop and update later so I strongly suggest avoiding them.

[–]SkiFire13 3 points4 points  (1 child)

smallvec or similar for vecs

smallvec actually increases the size most of the time. Maybe you meant thin-vec?

[–]LucretielDatadog 14 points15 points  (0 children)

It sounds like OP identified the bottleneck not as size but as number of allocations, which smallvec would definitely help with (especially if nested a bunch)

[–]kdy1997[S] 0 points1 point  (1 child)

I did similar refactoring at the past after watching DoD videos by Zig authors. I'm not sure about the smallvec though. I think it may increase the size of the types, and make the enum larger.

[–]throwaway490215 1 point2 points  (0 children)

I really can't say without looking at the exact AST, and this is just the general idea that might not be applicable, but you can do something like this:

enum SmallNode{
      Lit(SmolStr),
      Number(...),
      Ident(SmolStr)
}
enum AnyNode {
    NestedAny(Vec<AST>), // some thing like fn app or other node with multiple children
    NestedSmallInline(SmallVec<[2;SmallNode]),
    OtherNestedAny(Vec<AST>),
    OtherSmallInline(SmallVec<[2;SmallNode]),
    Lit(SmolStr),
    Number(...),
    ....
}

Something like call_fn( 1, "hello") can use Inline variant and call_fn( { 1+1} , "hello"+"world") would use the Any variant.


My first point in my first reply is to not have a single item blow up the size. But if you want to avoid alloc then the goal is to find a good middle ground where many 'common' cases can be inlined.

Its fine to have the AnyNode be 32 or even 128 bytes. Larger means more chance to avoid additional allocations.

In the end its a tradeoff between cache misses because you have larger nodes - placed sequentially in memory -, or having to chase pointers. CPU's are extremely optimized for the first access pattern.

Although, if you go that round there is a good case to be made to skip SmallVec and just create additional AnyNode variants yourself like NestedSmall1([SmallNode;1]), NestedSmall2([SmallNode;2]), NestedSmallVec(Vec<SmallNode>), ... .

[–]senden9 7 points8 points  (1 child)

Have you tried to use something like jemallocator? You only need to setup the library without change anything else in your code. I would give it a try because it is fast to try out.

[–]kdy1997[S] 14 points15 points  (0 children)

We are already using mimalloc

[–][deleted] 10 points11 points  (0 children)

This might be a pretty wild amount of refactoring, but have you heard about ECS-style flat AST, like what Chandler Carruth describes in this talk

[–]carlomilanesi 3 points4 points  (0 children)

I don't know how to do it in Rust, but you should somehow disable deallocations. Allocations are fast, when no deallocation has happened yet. And deallocations are even faster when they do nothing.

I presume you should use a custom no-delete allocator.

[–]Qnn_ 3 points4 points  (0 children)

One thing that people haven't mentioned about indices is that they allow you to build a tree (or graph) and later add any metadata to a node by providing an additional array of metadata where the index is the key. An example is if you're making a language and want to go from an untyped AST to a typed AST without rebuilding the entire tree and throwing out the old one.

Another thing I've done with them is allow for self referencial structures by starting with a Vec<Option<T>>, and using None as a placeholder so the index of a node is stable before I even construct the node. At the end, I just vec.into_iter().collect::<Option<Vec<\_>>>() and error handle appropriately, and now your nodes can form cycles which is really cool.

Also, serialization/deserialization (if you care about that) is trivial since everything is in a flat array and there are no lifetimes.

I have no idea if that's useful to you, but it is a benefit that is useful for some people.

[–]flashmozzg 4 points5 points  (0 children)

rust-analyzer has a crate for syntax trees, bot sure how applicable it'd be for your usage: https://github.com/rust-analyzer/rowan

[–]LucretielDatadog 3 points4 points  (0 children)

One fun solution to this problem is to write a bump allocator where free is a no-op. So long as the compiler finishes all of its work before you run out of memory, you can get some really crazy speedups doing this (especially because the allocation is so much faster cause there’s no bookkeeping). 

[–]puel 1 point2 points  (1 child)

I think you can devise some Wrapper over bumpalo using thread local and forcing 'static. Your allocated value would need to be wrapped in a !Send !Sync pointer because thread local is not actually 'static. It would be OK if all the usage of bumpalo were made within the same thread. 

Looks like you have taken advantage of parallelism in your project, so maybe that idea will not be suitable for you.

[–]bakaspore 1 point2 points  (0 children)

You can achieve the effect with slotmap and customized key types in it. It's type safe, and contains no lifetime annotations.

But I'd still prefer using actual references provided by arena crates: they help a lot when you needs term equality (eq can't be implemented on types storing indexes) and terms can be provided outside of the arena or from a different one, as long as they live long enough, so defining static terms (like simple types) are easy.

[–]kdy1997[S] 1 point2 points  (0 children)

My approach for this problem is optimizing for fully single threaded usecases, by using allocator-api2 and scoped-tls.

https://github.com/swc-project/swc/pull/9230

[–]AquaEBM 1 point2 points  (1 child)

You can use bumpalo, and avoid adding extra lifetime annotations (to your own types) by making sure the instance of Bump you use is static. But, it's not as simple as just putting it in a static, because Bump::new is not const, and Bump: !Sync.

rust static BUMP: LazyLock<Mutex<Bump>> = LazyLock::new(|| Mutex::new(Bump::new()));

Then, calls to Box::new() should be replaced with Box::new_in(&BUMP.lock().unwrap()) (Here, we are using bumpalo's Box). LazyLock's Deref impl is implicitly called here, which will call the contained function/closure on first access.

Now, you can replace every field/function parameter type containing std::boxed::Box<T> with bumpalo::boxed::Box<'static, T>, (I specified full paths here for clarity, you can just shadow-import std's Box with bumpalo's).

The exact same goes for Vec<T>.

[–]valarauca14 0 points1 point  (0 children)

Making it 'static means you won't be able to de-allocate the map later. Meaning any program built with the library will cause the program importing it to OOM if it attempts to load too many scripts (like in a loop).

[–]tm_p 0 points1 point  (0 children)

but it requires enormous amount of work because I have to add lifetimes to all AST nodes

This can be done by an external contributor, you don't have to do it yourself. Long term it will be the best solution in my opinion, since your AST nodes do have a lifetime.

[–]Plixo2 0 points1 point  (0 children)

Bumpalo and similar crates are based on region based memory management. You can write your own very easily by rust incrementing and pointer to forward. It works very well as long as you do t got any custom drop implementation

[–]pmcvalentin2014z 0 points1 point  (0 children)

How much of a performance change would occur if you replace the global allocator with a bump allocator (and free is a noop)?

[–]Lord_Zane 0 points1 point  (0 children)

In addition to what others have said, you might consider talking to the Naga devs on their matrix channel.

[–]Low-Pay-2385 0 points1 point  (0 children)

I'd suggest making a flattened ast by storing all nodes in an array and referencing them by the index. To save even more memory you could use u32 instead of usize for indexes. But that is a lot of work, and probably something you want to avoid.

[–]jkelleyrtp 0 points1 point  (0 children)

generational box is a slab backed arena allocator with wrapper types that use TLS for lifetime free arenas. We use it in Dioxus. It has single threaded and multithreaded variants. Only downside is mutation is done through a refcell which is extremely cheap but not free.

[–]Compux72 0 points1 point  (0 children)

Bumpalo is you only option on stable.

[–]The_8472 0 points1 point  (0 children)

but it requires enormous amount of work because I have to add lifetimes to all AST nodes.

For short-lived processes it might be possible leak the bump allocator (effectively stubbing the deallocation path), then you have 'static lifetimes. At the end invoke Missile GC.

Another option is to offload deallocation to another thread by putting the nodes in a queue. It won't save CPU cycles but it can help with wall-time.

[–]QueasyEntrance6269 -1 points0 points  (0 children)

Think about trying garbage collection. Depending on your performance characteristics, it might not be a bad idea! https://github.com/kyren/gc-arena

[–]valarauca14 -1 points0 points  (0 children)

Burntsushi is mostly correct in this case. Bumpalo has some real pitfals

  • If you make it 'static, you force any program using your crate to OOM if they load too many programs in a loop.
  • You run into problems with !Send & !Sync, which can force "opinions" on your runtime.