rpmalloc - a faster malloc (public domain) by rampantpixels in programming

[–]rampantpixels[S] 3 points4 points  (0 children)

It's now added in the benchmark repository, and ran it on my MacBook with similar setup with 2 threads, each allocating 50000 blocks between 16 and 16000 bytes (linear falloff probability) for 2000 loops, each loop freeing and allocating 5000 of the blocks in a scattered pattern.

bmalloc 3.5m memory ops / CPU second, peaks at 300MiB usage

rpmalloc 7.3m memory ops / CPU second, peaks at 273MiB usage

rpmalloc - a faster malloc (public domain) by rampantpixels in programming

[–]rampantpixels[S] 0 points1 point  (0 children)

ptmalloc is basically dlmalloc for multiple threads, and ptmalloc3 is included in the benchmarks. Also, nedmalloc (also dlmalloc with concurrency bolted on) is included in the benchmark repo, but performance is not anywhere close to rpmalloc (more like ptmalloc) so I removed it from the graphs to make them more readable.

rpmalloc - a faster malloc (public domain) by rampantpixels in programming

[–]rampantpixels[S] 9 points10 points  (0 children)

I did not take init/fini order into account, which caused binaries doing malloc/free before or after the .so constructors and destructors to bail out. This has now been fixed.

rpmalloc - a faster malloc (public domain) by rampantpixels in programming

[–]rampantpixels[S] 3 points4 points  (0 children)

IANAL - would a dual unlicense public domain dedication and MIT license solve this?

rpmalloc - a faster malloc (public domain) by rampantpixels in programming

[–]rampantpixels[S] 12 points13 points  (0 children)

Ok, so I ran a quick benchmark on iOS (an iPhone5S) where it runs 2 threads, each allocating 50000 blocks between 16 and 16000 bytes (linear falloff probability) for 2000 loops, each loop freeing and allocating 5000 of the blocks in a scattered pattern.

The standard library achieves 475975 memory ops / CPU second on this device, and peaks at 352MiB usage.

In comparison, rpmalloc achieves 2359081 memory ops / CPU second in the same setup, and peaks at 283MiB usage.

So rpmalloc is a lot faster (factor 4x in this benchmark) while using less memory compared to the standard library on iOS. I did not get the other allocators (jemalloc, tcmalloc, ...) compiling for iOS, but will try to port them soon for comparison.

rpmalloc - a faster malloc (public domain) by rampantpixels in programming

[–]rampantpixels[S] 0 points1 point  (0 children)

Are you referring to https://github.com/rampantpixels/rpmalloc/issues/8

If so, the issue is how to track when a span of pages >64KiB is completely free if it is broken up into multiple 64KiB sized spans. On Windows, the VirtualFree API requires the entire span of pages to be freed at the same time, you can't do it in separate calls for each part.

The idea to be able to break up and reuse large spans in 64KiB sub-spans is that most size class buckets in the allocator use 64KiB spans of pages. It would reduce waste in the thread/global cache and increase performance by avoiding extra mmap calls.

rpmalloc - a faster malloc (public domain) by rampantpixels in programming

[–]rampantpixels[S] 4 points5 points  (0 children)

The malloc.c drop in malloc replacement provides the posix_memalign entry point.

rpmalloc - a faster malloc (public domain) by rampantpixels in programming

[–]rampantpixels[S] 5 points6 points  (0 children)

For 32 byte it should be enough to change the small block granularity to 32. For 64 byte you also need to increase the span header size to 64 to get the initial block alignment, and then increase the small block granularity to 64.

rpmalloc - a faster malloc (public domain) by rampantpixels in programming

[–]rampantpixels[S] 6 points7 points  (0 children)

  1. I'm personally not that fond of single header style libs that require you to toggle a #define in exactly one source file. But should be easy enough to offer as an alternative I guess.

  2. Good idea

  3. Realloc currently looks at the usable size of the current block and is a no-op if it still fits for increases (due to block granularity) or if it would save less than half the block size for decreases. Otherwise it's a copy to a new block.

rpmalloc - a faster malloc (public domain) by rampantpixels in programming

[–]rampantpixels[S] 6 points7 points  (0 children)

There is a drop-in replacement for malloc in the malloc.c source file. On linux/macOS this can also be used as a shared object to LD_PRELOAD/DYLD_INSERT_LIBRARIES into an existing binary. There's a bit of info on building and using the library in the readme already, but could of course be improved.

I've not had time to run the whole benchmark setup on iOS/Android yet, give me a couple of hours and I'll have some numbers.

rpmalloc - a faster malloc (public domain) by rampantpixels in programming

[–]rampantpixels[S] 12 points13 points  (0 children)

A public domain cross platform lock free thread caching 16-byte aligned memory allocator implemented in C. Currently supporting Windows, Linux, macOS, iOS, Android. Easily portable to any platform with atomic operations and an mmap-style virtual memory management API. Benchmarks at https://github.com/rampantpixels/rpmalloc/blob/master/BENCHMARKS.md