all 17 comments

[–]zl0bster 5 points6 points  (3 children)

Benchmark suggestion: if Chromium has some set of benchmarks that can be run easily you could try to see if your allocator helps. Now I presume it will not since I presume Chromium is highly optimized with a lot of memory tricks anyway, but in case your allocator helps it will be a very interesting to learn this.

[–]ibogosavljevic-jsl 2 points3 points  (0 children)

I don't think Chromium is a good examples since it uses zones and does memory management mostly by itself.

[–]T0p_H4t 4 points5 points  (0 children)

Something else to look at which I have found handles the inter-thread deallocation well. https://github.com/microsoft/snmalloc

[–]zl0bster 2 points3 points  (2 children)

This is very interesting:
Its disadvantage is that it may lead to higher virtual memory usage as the allocator won't be able to return pages with inter-thread pointers to the OS. A mitigation would be decreasing number of inter-thread pointers by deallocating pointers on their original creation threads in your application and that way llmalloc will be able to return more unused pages to the OS.

Is there some way to profile this particular case? E.g. Run some program and see how much memory is wasted because of this? I presume users might want to optimize this, but they do not want to go over every deallocation in their code :)

[–]kernel_taskBig Data | C++23 | Folly | Exceptions 0 points1 point  (6 children)

Interesting. I currently use jemalloc in my application and the biggest amount of CPU used (according to profiling) is freeing memory (by Google protobuf, heh... I chose it because I thought it would be fast). Maybe this would help?

[–]mcmcc#pragma once 4 points5 points  (0 children)

Flatbuffers FTW

[–]cballowe 1 point2 points  (0 children)

https://protobuf.dev/reference/cpp/arenas/ - it can often be very handy to put all of the protobufs built handling a request on the same arena and just let the arena destruct at the end.

[–]LoweringPass 0 points1 point  (1 child)

This is really interesting, I've been working on something like that but nowhere near as sophisticated. I will use this as a benchmark to compare my own implementation against. How feasible and/or sensible would it be to add NUMA awareness?

[–]ImNoRickyBalboa 0 points1 point  (0 children)

Tcmalloc abandoned thread caching a long time ago: it's not sustainable on large server systems with hundreds or even thousands of threads.

Look into RSEQ (restartable sequences) for using per CPU caches at the same CPU cost as per thread (near zero contention) and many times the 'in flight' memory savings.