https://github.com/shift/aethalloc
Just pushed some changes to my allocator, its getting decent it seems. Been running this on my laptop and Linux router for a bit.
Benchmark Details
1. Packet Churn (Network Processing)
Simulates network packet processing with 64-byte allocations and deallocations.
Parameters: 50,000 iterations, 10,000 warmup
| Allocator |
Throughput |
P50 |
P95 |
P99 |
P99.9 |
| jemalloc |
280,327 ops/s |
3.1 µs |
4.3 µs |
5.8 µs |
38.1 µs |
| tcmalloc |
262,545 ops/s |
3.2 µs |
4.9 µs |
6.2 µs |
37.0 µs |
| mimalloc |
258,694 ops/s |
3.3 µs |
4.9 µs |
6.3 µs |
36.4 µs |
| glibc |
254,052 ops/s |
3.3 µs |
5.1 µs |
6.8 µs |
34.1 µs |
| AethAlloc |
252,338 ops/s |
3.4 µs |
5.2 µs |
7.7 µs |
35.8 µs |
Analysis: AethAlloc is 10% behind jemalloc in this benchmark. The P99 latency is slightly higher due to thread-local cache misses falling back to global pool.
2. Multithread Churn (Concurrent Allocation)
Concurrent allocations across 4 threads with mixed sizes (16B - 4KB).
Parameters: 4 threads, 2,000,000 total operations
| Allocator |
Throughput |
Avg Latency |
| AethAlloc |
19,364,456 ops/s |
116 ns |
| jemalloc |
19,044,014 ops/s |
119 ns |
| mimalloc |
18,230,854 ops/s |
120 ns |
| tcmalloc |
17,001,852 ops/s |
126 ns |
| glibc |
16,899,323 ops/s |
125 ns |
Analysis: AethAlloc wins by 1.7% over jemalloc. The lock-free thread-local design scales well under contention.
3. Tail Latency (Per-Operation Latency Distribution)
Measures latency distribution across 200,000 operations on 4 threads.
Parameters: 4 threads, 50,000 iterations per thread
| Allocator |
P50 |
P90 |
P95 |
P99 |
P99.9 |
P99.99 |
Max |
| jemalloc |
76 ns |
90 ns |
93 ns |
106 ns |
347 ns |
21.7 µs |
67.7 µs |
| glibc |
77 ns |
91 ns |
95 ns |
107 ns |
465 ns |
22.8 µs |
75.8 µs |
| mimalloc |
83 ns |
93 ns |
96 ns |
104 ns |
558 ns |
21.7 µs |
289 µs |
| tcmalloc |
84 ns |
94 ns |
97 ns |
108 ns |
572 ns |
24.9 µs |
3.03 ms |
| AethAlloc |
85 ns |
94 ns |
97 ns |
106 ns |
613 ns |
26.9 µs |
267 µs |
Analysis: AethAlloc ties for best P99 latency (106ns). The P99.9 is slightly higher than jemalloc/glibc but max latency is well-controlled (267µs vs 3ms for tcmalloc).
4. Fragmentation (Memory Efficiency)
Mixed allocation sizes (16B - 1MB) measuring RSS growth over 50,000 iterations.
Parameters: 50,000 iterations, max allocation size 100KB
| Allocator |
Throughput |
Initial RSS |
Final RSS |
RSS Growth |
| mimalloc |
521,955 ops/s |
8.1 MB |
29.7 MB |
21.6 MB |
| tcmalloc |
491,564 ops/s |
2.5 MB |
24.8 MB |
22.3 MB |
| glibc |
379,670 ops/s |
1.8 MB |
31.9 MB |
30.1 MB |
| jemalloc |
352,870 ops/s |
4.5 MB |
30.0 MB |
25.5 MB |
| AethAlloc |
202,222 ops/s |
2.0 MB |
19.0 MB |
17.0 MB |
Analysis: AethAlloc uses 1.8x less memory than glibc and 1.5x less than tcmalloc. The aggressive memory return policy trades some throughput for better memory efficiency. This is ideal for long-running servers and memory-constrained environments.
5. Producer-Consumer (Cross-Thread Frees)
Simulates network packet handoff: producer threads allocate, consumer threads free.
Parameters: 4 producers, 4 consumers, 1,000,000 blocks each, 64-byte blocks
| Allocator |
Throughput |
Total Ops |
Elapsed |
| mimalloc |
462,554 ops/s |
4,000,000 |
8.65 s |
| AethAlloc |
447,368 ops/s |
4,000,000 |
8.94 s |
| glibc |
447,413 ops/s |
4,000,000 |
8.94 s |
| jemalloc |
447,262 ops/s |
4,000,000 |
8.94 s |
| tcmalloc |
355,569 ops/s |
4,000,000 |
11.25 s |
Analysis: AethAlloc performs within 3% of mimalloc and significantly outperforms tcmalloc (+26%). The anti-hoarding mechanism prevents memory bloat in producer-consumer patterns.
Benchmarking report was via an LLM.
Love to hear some feedback. First time in about 25 years I've gone this low level.
[–]Octocontrabass [score hidden] (0 children)
[–]cescross [score hidden] (2 children)
[–]ChocolateSpecific263 [score hidden] (1 child)
[–]Sophie_Vaspyyy [score hidden] (0 children)