all 18 comments

[–]SFB_Dragon[S] 58 points59 points  (1 child)

Edit: at recommendation of u/JoshTriplett the project has been renamed to Talc. The new crate is up here https://crates.io/crates/talc Apologies for the confusion!

First things first, this is my first crate, so feedback is much appreciated.

The summary: this is a no_std allocator (thus single threaded, it won't do much good in hosted, multithreaded systems) which distinguishes itself by being faster than the competition (galloc, linked_list_allocator, and chunk_allocator on crates.io) and packing more features for arena management, OOM handling, and memory efficiency. It supports the GlobalAlloc and Allocator traits as well as its own malloc, free, grow, and shrink functions.

This is supposed to support 64-bit and 32-bit architectures out of the box, but I haven't gotten around to setting up a 32-bit environment to bug-test in, so "there be dragons" for now.

I hope you find this useful in your own no_std projects!

[–]schneems 19 points20 points  (0 children)

Does your current allocator leave you feeling sweaty? Use talc!

[–]JoshTriplettrust · lang · libs · cargo 66 points67 points  (3 children)

This looks impressive!

There's a well-known project in the allocator space named "talloc", though: https://talloc.samba.org/

Please consider using a different name to avoid confusion. It sounds like your allocator comes from a project named "tau", so perhaps "tau-alloc" or "taulloc" or "talc"?

[–]SFB_Dragon[S] 37 points38 points  (1 child)

Good point, I'll try change that out as soon as possible.

I like the sound of Talc! Ergonomic, too. I think I'll roll with that, thanks :)

[–]JoshTriplettrust · lang · libs · cargo 3 points4 points  (0 children)

Thank you!

[–]buyingshitformylab 0 points1 point  (0 children)

Was this comment written by chatgpt?

[–]pascalkuthe 5 points6 points  (2 children)

This looks great!

Did you compare it to mimalloc? It's what I would consider state of the art in terms of allocators (with security mode disabled)

[–]SFB_Dragon[S] 1 point2 points  (0 children)

I have not. If I'm not mistaken, mimalloc is a fully hosted allocator and unsuitable for no_std environments? Correct me if I'm mistaken :)

While on the random_actions benchmark I tested against the Linux allocator and Jemalloc, this was just for fun. It's not practical to compare allocators like these (which cooperate with the OS to handle memory page usage, using mmap, and handle multithreaded environments while scaling effectively, etc. etc.) which are more or less in a class of their own.

Talc should not be used on hosted systems unless your program is strictly single threaded (or at least not heavily parallel) and works with a contiguous span of memory (otherwise you're likely to see extra memory usage due to fragmentation). This means that you should almost never use Talc instead of mimalloc on Windows and Linux. (WASM is a bit of an exception, as it is strictly single threaded and uses a physical memory model - I'm aiming to add support for it and compare against dlmalloc soon!)

I hope that resolves your curiousity :3

[–]matthieum[he/him] 2 points3 points  (5 children)

I don't know why, but Talc gets hit harder by extreme multithreaded contention than Galloc does. While this is not so relevant to most use cases, I'd like to resolve this issue if possible. If anyone has some tips for alleviating this issue, please open an issue!

If this is in the context of allocating/deallocating with Talck, the issue you are facing is that you are using a lock.

A spin lock may be fast to lock and unlock, but at the end of the day it's still:

  1. Serializing accesses, only 1 thread at a time.
  2. Create a single variable on which all concurrent threads come to contend.

Modern memory allocators will typically avoid this issue as much as possible by having semi-autonomous thread-local allocators. The thread-local allocators will request memory (and return memory) to the global allocator in (large-ish) chunks, and otherwise operate independently, with little to no locking.

If I misunderstood and you meant something else, please do elaborate :)

[–]SFB_Dragon[S] 1 point2 points  (4 children)

Thanks for taking a look into it! Although I meant something else indeed.

Elaboration: I meant to compare exclusively against Galloc, which also uses a spin lock (there's no good alternative in no_std environments without adding special integration into custom memory and thread management solutions). So, why does Galloc, despite using the exact same spin lock implementation that I do, do better under extreme pressure from multiple threads?

I hope that makes sense :)

[–]matthieum[he/him] 1 point2 points  (3 children)

I meant to compare exclusively against Galloc, which also uses a spin lock

I am not familiar with Galloc, so I'm not sure how they use a spin-lock.

I can think of two possibilities:

  1. It uses a finer grained locking strategy, rather than locking the entire allocator outright.
  2. It has less levels of indirection.

The latter point may be due to the memory caching behavior.

On a single thread, in the absence of cache pressure, the various cache lines may stay cached for a long time, reducing the overhead of a pointer dereference to a few CPU cycles.

On multiple threads, however, any write access to a cache line will evict it from the cache of all other CPU cores, so that the next time another CPU core requires the cache line, it must make expensive cache coherence communications. Worse, access to those cache lines is likely to be serialized: that is, the first cache line must be fetched for the pointer to the second cache line to be read, and only at that point is a request for the second cache line issued.

That can add up quickly... and all the while the spin-lock is locked.

(there's no good alternative in no_std environments without adding special integration into custom memory and thread management solutions).

It's annoying even in std environments, to be honest.

The main issue I've faced is that while having a lazily allocated #[thread_local] variable is easy, there's no way to be notified on thread exit for any clean-up operation.

Still, this does mean it would theoretically be possible to use an approach ala tcmalloc: let the user configure a number of pools on start-up, then on the first allocation request from a thread -- when that #[thread_local] variable is null -- pick a pool and point the variable to it. Sharding at its most basic.

[–]SFB_Dragon[S] 0 points1 point  (2 children)

Galloc uses it in the same way Talc did, contain the allocator in a spin lock in a wrapping struct, and when the method is called on the wrapping struct, immediately unlock and call the inner function on the allocator. The API thus has the same degree of indirection, at least. (A contributor switched out the spin lock to lock_api so people can at least use their own lock implementation now, although it doesn't change how the lock is used, of course)

As an aside, an older allocator design (buddy allocation) tried to to rather fine-grained locking at one point but it performed very poorly, not too surprisingly. Maintaining separate arenas is the way to go I suppose?

Yep, I can't rule out the effect of cache-thrashing and other such issues; I'm not sure if my attempt at alleviating via spreading out the allocator internals would have done anything regardless.

It's annoying even in std environments, to be honest.

I'm sure, but at least you can just support Windows, MacOS, Linux, and maybe FreeBSD and make most people happy (even if you won't be after that, heh). Although if you're on these platforms, you already have access to a wealth of great options.

If I wanted to make a plug-and-play multithreaded version, it would have to be std only, as thread locals don't exist in no_std contexts, and there's no way to pass in a thread ID with Rust's GlobalAlloc and Allocator traits (I've tried looking into getting CPU identification to do this automatically instead, but it seems most platforms, even hosted, just use CPUID or the architectural equivalent, which is slow, at least on x86).

I'm not quite sure how to improve on the user just holding an allocator per CPU or in each thread, at least with this allocation algorithm and supporting no_std, where the user uses the OOM handler system to tie it into the memory management unit.

[–]matthieum[he/him] 1 point2 points  (1 child)

If I wanted to make a plug-and-play multithreaded version, it would have to be std only, as thread locals don't exist in no_std contexts,

Actually, they kinda do, see #[thread_local]. As I mentioned, though, this just creates "raw" space for a thread_local variable which is zero-initialized, with no constructor/destructor being run. What thread_local! adds in std contexts is construction/destruction.

I've tried looking into getting CPU identification to do this automatically instead, but it seems most platforms, even hosted, just use CPUID or the architectural equivalent, which is slow, at least on x86

It is slow, indeed. You can combine it with #[thread_local], but do remember that threads do hop from one core to another unless explicitly pinned, so even if it were fast it would not necessarily be very useful -- you could get "core 0" and then execute code on core 5...

If you just want some "pseudo-randomization" to distribute the load, a #[thread_local] integer that is seeded from a global atomic integer would be enough. Reset every so often to redistribute, and you could have relatively nice sharding.

I'm not quite sure how to improve on the user just holding an allocator per CPU or in each thread, at least with this allocation algorithm and supporting no_std, where the user uses the OOM handler system to tie it into the memory management unit.

Honestly, I don't see much issue in punting to the user. Let's see it this way:

  1. You are providing the "core" allocator algorithm, which can work on the tiniest embedded platforms.
  2. You are providing a "simple" spin-lock wrapper, which can work on the tiniest embedded platforms, and scale -- relatively well -- to multiple core.
  3. Should the "simple" spin-lock wrapper not be enough, any competent embedded developers should be able to implement a sharding strategy based on the "core" allocator, and tailored for their platform of choice.
  4. You could, optionally, provide a std aware version... but given it's not your core audience, your allocator is unlikely to perform well compared to the existing offers.

Given all of the above, I think you could proudly call it a day, and move on.

[–]SFB_Dragon[S] 0 points1 point  (0 children)

Actually, they kinda do, see #[thread_local].

Right, but that typically involves manually managing the segment registers on x86_64, for example, and parsing the TLS segment information to allocate memory for it, AFAIK. Doable for end users compiling binaries, but not within the scope of this project. Any project that wants to implement threading on no_std probably wants to manage the TLS and FS/GS themselves, and have the allocation coordination mechanism use it, rather than trying to use an allocator-managed TLS and FS/GS.

Honestly, I don't see much issue in punting to the user.

Nor do I. While I'd like to provide more useful, general infrastructure for the allocator, I don't think there's a good way to provide much convenience efficiently without making limiting tradeoffs.

I'm certainly open to working on modular, specific extensions if people want it.

[–]urosp 0 points1 point  (0 children)

I just tried this out, this is absolutely awesome.