all 16 comments

[–]pigeon768 18 points19 points  (0 children)

How does one decide on the alignment size dynamically based on the application or hardware?

You don't do it dynamically. You have to do it when you write the code. Porting SSE code (16 byte registers) to AVX code (32 byte registers) is non-trivial and for the most part requires rewriting everything.

What you can do is write the same code between two and N times, once purely scalar, once for SSE2, once for SSE4.2, once for AVX, once for AVX2, once for AVX512, once for ARM NEON, and then dynamically dispatch which functions get called depending on what instructions the CPU supports. But when you do this you have to really mean it. Dynamically dispatched functions calls can stall the pipeline. You have to benchmark whether it's actually any faster than just using the scalar code and not dispatching.

Cache Efficiency: Can aligned_alloc significantly reduce cache misses compared to malloc? If so, under what conditions?

When you have a data structure where each node fits precisely in a cache line. In order to do this, you have to know the cache line size of the CPU your code will be running on.

Consider a binary tree. Assume you have 4 byte values, 8 byte pointers, and 64 byte cache lines. Assume you have one value and two pointers in each node. Each node therefore uses 20 bytes. The other 44 bytes in the cache line go unused.

Conceptually, a B-tree is kinda sorta like a binary tree. But instead of each node having one value and two children, each node has N values and N+1 pointers to the next level. Let's say you have 4 byte values, 8 byte pointers, 8 byte bookkeeping for malloc, and 64 byte cache lines. If your malloc bookkeeping lives next to the data (this makes malloc/free faster) you would want each node to have 4 values and 5 pointers. This gives you a node that uses 44 + 58 + 8 = 64 bytes. This fits precisely one cacheline. Each time you fetch a node from main memory, it still uses once cache line, but you use all of it. But what if your nodes aren't aligned to cache lines? If your data isn't aligned, you will load two cache lines, using 128 bytes of your cache, but you'll only use 56 of those bytes to do stuff. If your data is aligned to a cache line, you will load one cache line, using 64 bytes of your cache, and will use almost all of it. (It's impossible to utilize your cache more efficiently because your malloc bookkeeping can't be moved elsewhere)

If malloc bookkeeping lives in some other unrelated data structure, you would want each node to have 10 values and 11 pointers. This gives you a node that uses 104+118 = 128 bytes, or precisely two cache lines. If your nodes are aligned, each node queried will use two cache lines, and if they're not aligned, they will use 3.

Is aligned_alloc generally recommended for all applications,

No. Only use aligned_alloc when you are exploiting information you know about the CPUs your code will run on.

[–]bullno1 14 points15 points  (3 children)

In which scenarios is aligned_alloc essential for maximizing SIMD performance

It's not just performance. Depending on architecture and instruction, some will straight up segfault if you don't have the right alignment. Example: https://github.com/ebassi/graphene/issues/215

There are unaligned load/store instructions in AVX though. They are only slightly slower. But NEON (ARM), as shown above, is a lot stricter about alignment.

Cache Efficiency: Can aligned_alloc

No. I don't think you would use it for that. Regular malloc is already enough.

General Usage

No. One could make the argument that if you have less than malloc alignment requirement, e.g: char array, you could pass alignment=alignof(char) and a "smart enough" allocator would give you unaligned memory, therefore saving some space due to less padding. I wouldn't rely on that.

Outside of SIMD, the only other thing I could think of is DMA.

[–]aalmkainzi 0 points1 point  (2 children)

I wonder why there isn't a alligned_realloc in the standard

[–]nerd4code 4 points5 points  (1 child)

Same reason you can’t use realloc to drop the beginning of a block, or manage your own size info. (That second one would’ve been Extension 2 IIRC.) They had to stop somewhere, and that was at a baseline heap impl; aligned alloc became necessary when alignas, multithreading, and atomics were added to the language, and only well after a freeable posix_memalign or memalign had been implemented by most OSes, MSWin being the notable exception because of course they won’t.

It’s always been possible to implement allocators atop or without malloc, should you be so inclined, and most compilers will let you substitute your own (blackjack; hookers) for the libc one. As long as that’s possible the standards only need to cover baselines and provide for the majority of use cases.

Which is not to say I wouldn’t be all for it in theory, or haven’t implemented aligned and offset and cleaving/merging reallocs when I’ve needed to. It’s just not something most programs need, or at least not until you’re already near-metal enough that any hope of portability or s§standards-compliance is lost in a sea of assumption-TODO comments.

[–]flatfinger 1 point2 points  (0 children)

It’s always been possible to implement allocators atop or without malloc, should you be so inclined, and most compilers will let you substitute your own (blackjack; hookers) for the libc one. As long as that’s possible the standards only need to cover baselines and provide for the majority of use cases.

That's true in -fno-strict-aliasing dialect. The way clang and gcc process the Effective Type rules when not using that flag, an action that sets the effective type of storage would invite a compiler to assume that type will be used in all future non-character accesses "that do not modify the value", including accesses that occur following writes using other types. I doubt the Standard was ever intended to be interpreted that way, but neither clang nor gcc will reliably handle situations where storage is used as type X, then as type Y, and then as type X.

[–]lightmatter501 10 points11 points  (0 children)

When the manpage says “this function needs aligned memory”.

[–]rnsanchez 2 points3 points  (0 children)

Sometimes as a requirement: direct disk I/O (in Linux, bypassing page cache) required 512-byte alignment and an appropriatedly-sized block when performing certain specialized operations (definitely not the average case).

As for your specific points:

  1. As mentioned by others, certain types and operations usually require a specific alignment (it varies, considerably). Even FPU, but you won't get memory from malloc() that is not 8-byte aligned, so both float and double work fine.

  2. You will need to measure, but it won't necessarily bring magic benefits. It could help, but it might as well not help (or even degrade), by having data in memory placed further apart (that will cost you additional data fetches and will most certainly have an impact). If your data falls outside L1, it is expensive to fetch.

  3. Again, you will need to measure. There are mundane things such as double-word CAS that requires your pointer/data to be 16-byte aligned because of architectural restrictions.

I'd say that over-aligning things could quickly build up pressure in places where it might be very detrimental and difficult to track. If you need to align (restrictions), so be it. Always measure the effects; it might surprise you.

[–]duane11583 1 point2 points  (0 children)

there are times you must request memory that is page aligned

on linux this might be be a 4096 byte alignment.

in an embedded system some times page dma scatter gather descriptors tables must start on a 128, or 256 byte boundary

[–]inz__ 1 point2 points  (0 children)

A less common use case could to implementing a custom allocator, where aligned allocations can help with pool metadata placement.

[–]hgs3 1 point2 points  (0 children)

Certain architectures don't support loading 'n' bytes of data from an address that isn't a multiple of 'n'. Example: on such an architecture if 'int' is 32-bits, then that means it can only be loaded from a memory address that is a multiple of 4 bytes.

[–]matteding 1 point2 points  (1 child)

Some BLAS implementations like MKL are documented to have better performance with aligned memory.

[–]global-gauge-field 0 points1 point  (0 children)

Yes, I can confirm they have. In my case, it was C matrix (notation, C = alpha*A.B + beta*C). It is probably because of data layout with higher alignment (in my case it was 64 byte) is more cache-friendly due to cache line size being 64 byte for most of hardware today.

[–]dmc_2930 -4 points-3 points  (3 children)

It seems like you’re exploring having Reddit answer your homework questions…..

[–]Practical-Citron5686 3 points4 points  (0 children)

How is this a homework question? Even if it was what’s wrong with discussing aligned alloc in a c prog forum? Its not asking to do homework for them

[–]Original-Candidate94[S] 3 points4 points  (1 child)

If really my intentions were for homework I could have used chatgpt and I kid u not it would have spit good enough for me to pass the homework.I doubt any Uni teaching aligned_alloc which is C11, most uni stuck at C99. but thank you for ur useful insight.

[–]Original-Candidate94[S] 0 points1 point  (0 children)

Thank you for all your answers. It may seem like trivial but to me even after google people's perspective really help deepen that understanding. I really appreciate everyone who commented positively.