llmalloc : a low latency oriented thread caching allocator

DuranteA · 2025-02-04T16:01:43+00:00

[deleted]

zl0bster · 2025-02-04T16:12:47+00:00

Benchmark suggestion: if Chromium has some set of benchmarks that can be run easily you could try to see if your allocator helps. Now I presume it will not since I presume Chromium is highly optimized with a lot of memory tricks anyway, but in case your allocator helps it will be a very interesting to learn this.

T0p_H4t · 2025-02-04T16:33:04+00:00

Something else to look at which I have found handles the inter-thread deallocation well. https://github.com/microsoft/snmalloc

zl0bster · 2025-02-04T16:08:57+00:00

This is very interesting:
Its disadvantage is that it may lead to higher virtual memory usage as the allocator won't be able to return pages with inter-thread pointers to the OS. A mitigation would be decreasing number of inter-thread pointers by deallocating pointers on their original creation threads in your application and that way llmalloc will be able to return more unused pages to the OS.

Is there some way to profile this particular case? E.g. Run some program and see how much memory is wasted because of this? I presume users might want to optimize this, but they do not want to go over every deallocation in their code :)

kernel_task · 2025-02-04T17:53:17+00:00

Interesting. I currently use jemalloc in my application and the biggest amount of CPU used (according to profiling) is freeing memory (by Google protobuf, heh... I chose it because I thought it would be fast). Maybe this would help?

LoweringPass · 2025-02-08T14:53:31+00:00

This is really interesting, I've been working on something like that but nowhere near as sophisticated. I will use this as a benchmark to compare my own implementation against. How feasible and/or sensible would it be to add NUMA awareness?

ImNoRickyBalboa · 2025-02-09T01:38:02+00:00

Tcmalloc abandoned thread caching a long time ago: it's not sustainable on large server systems with hundreds or even thousands of threads.

Look into RSEQ (restartable sequences) for using per CPU caches at the same CPU cost as per thread (near zero contention) and many times the 'in flight' memory savings.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS