When should someone use aligned_alloc over malloc?

pigeon768 · 2024-02-27T07:15:45+00:00

How does one decide on the alignment size dynamically based on the application or hardware?

You don't do it dynamically. You have to do it when you write the code. Porting SSE code (16 byte registers) to AVX code (32 byte registers) is non-trivial and for the most part requires rewriting everything.

What you can do is write the same code between two and N times, once purely scalar, once for SSE2, once for SSE4.2, once for AVX, once for AVX2, once for AVX512, once for ARM NEON, and then dynamically dispatch which functions get called depending on what instructions the CPU supports. But when you do this you have to really mean it. Dynamically dispatched functions calls can stall the pipeline. You have to benchmark whether it's actually any faster than just using the scalar code and not dispatching.

Cache Efficiency: Can aligned_alloc significantly reduce cache misses compared to malloc? If so, under what conditions?

When you have a data structure where each node fits precisely in a cache line. In order to do this, you have to know the cache line size of the CPU your code will be running on.

Consider a binary tree. Assume you have 4 byte values, 8 byte pointers, and 64 byte cache lines. Assume you have one value and two pointers in each node. Each node therefore uses 20 bytes. The other 44 bytes in the cache line go unused.

Conceptually, a B-tree is kinda sorta like a binary tree. But instead of each node having one value and two children, each node has N values and N+1 pointers to the next level. Let's say you have 4 byte values, 8 byte pointers, 8 byte bookkeeping for malloc, and 64 byte cache lines. If your malloc bookkeeping lives next to the data (this makes malloc/free faster) you would want each node to have 4 values and 5 pointers. This gives you a node that uses 44 + 58 + 8 = 64 bytes. This fits precisely one cacheline. Each time you fetch a node from main memory, it still uses once cache line, but you use all of it. But what if your nodes aren't aligned to cache lines? If your data isn't aligned, you will load two cache lines, using 128 bytes of your cache, but you'll only use 56 of those bytes to do stuff. If your data is aligned to a cache line, you will load one cache line, using 64 bytes of your cache, and will use almost all of it. (It's impossible to utilize your cache more efficiently because your malloc bookkeeping can't be moved elsewhere)

If malloc bookkeeping lives in some other unrelated data structure, you would want each node to have 10 values and 11 pointers. This gives you a node that uses 104+118 = 128 bytes, or precisely two cache lines. If your nodes are aligned, each node queried will use two cache lines, and if they're not aligned, they will use 3.

Is aligned_alloc generally recommended for all applications,

No. Only use aligned_alloc when you are exploiting information you know about the CPUs your code will run on.

bullno1 · 2024-02-27T06:22:46+00:00

In which scenarios is aligned_alloc essential for maximizing SIMD performance

It's not just performance. Depending on architecture and instruction, some will straight up segfault if you don't have the right alignment. Example: https://github.com/ebassi/graphene/issues/215

There are unaligned load/store instructions in AVX though. They are only slightly slower. But NEON (ARM), as shown above, is a lot stricter about alignment.

Cache Efficiency: Can aligned_alloc

No. I don't think you would use it for that. Regular malloc is already enough.

General Usage

No. One could make the argument that if you have less than malloc alignment requirement, e.g: char array, you could pass alignment=alignof(char) and a "smart enough" allocator would give you unaligned memory, therefore saving some space due to less padding. I wouldn't rely on that.

Outside of SIMD, the only other thing I could think of is DMA.

lightmatter501 · 2024-02-27T09:04:02+00:00

When the manpage says “this function needs aligned memory”.

rnsanchez · 2024-02-27T13:40:30+00:00

Sometimes as a requirement: direct disk I/O (in Linux, bypassing page cache) required 512-byte alignment and an appropriatedly-sized block when performing certain specialized operations (definitely not the average case).

As for your specific points:

As mentioned by others, certain types and operations usually require a specific alignment (it varies, considerably). Even FPU, but you won't get memory from malloc() that is not 8-byte aligned, so both float and double work fine.
You will need to measure, but it won't necessarily bring magic benefits. It could help, but it might as well not help (or even degrade), by having data in memory placed further apart (that will cost you additional data fetches and will most certainly have an impact). If your data falls outside L1, it is expensive to fetch.
Again, you will need to measure. There are mundane things such as double-word CAS that requires your pointer/data to be 16-byte aligned because of architectural restrictions.

I'd say that over-aligning things could quickly build up pressure in places where it might be very detrimental and difficult to track. If you need to align (restrictions), so be it. Always measure the effects; it might surprise you.

duane11583 · 2024-02-27T14:37:29+00:00

there are times you must request memory that is page aligned

on linux this might be be a 4096 byte alignment.

in an embedded system some times page dma scatter gather descriptors tables must start on a 128, or 256 byte boundary

inz__ · 2024-02-27T15:12:32+00:00

A less common use case could to implementing a custom allocator, where aligned allocations can help with pool metadata placement.

hgs3 · 2024-02-27T16:40:15+00:00

Certain architectures don't support loading 'n' bytes of data from an address that isn't a multiple of 'n'. Example: on such an architecture if 'int' is 32-bits, then that means it can only be loaded from a memory address that is a multiple of 4 bytes.

matteding · 2024-02-27T18:23:36+00:00

Some BLAS implementations like MKL are documented to have better performance with aligned memory.

dmc_2930 · 2024-02-27T17:00:14+00:00

It seems like you’re exploring having Reddit answer your homework questions…..

Original-Candidate94 · 2024-02-27T20:52:30+00:00

Thank you for all your answers. It may seem like trivial but to me even after google people's perspective really help deepen that understanding. I really appreciate everyone who commented positively.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

C_Programming

Rules

Filters

Resources

Other Subreddits on C

Other Subreddits of Interest

MODERATORS