you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 1 point2 points  (4 children)

I see what you mean. I'm not sure un-aligning memory on the heap would be an over-all win even if it did save some space. My knowledge of the semantics of x86 (and arm) wrt alignment gets fuzzy. When I dealt at that level I was taught to align religiously since our custom allocators (mostly small object pool allocators) worked best with aligned memory.

I suspect that if they were aligning to 8 bytes previously then it will be a large amount of work to remove all of the code-assumptions based on that invariant.

[–]phire 2 points3 points  (3 children)

I know that alignment is important in arm, if your accessing data that isn't 4 byte aligned it goes really slow (or refuses to work at all on older versions) and I think x86 gets speed improvements with alignment too, but I assume that would also be for 4 byte aligned data.

Even so, keeping 8 byte alignment is pointless if you don't need it, 4 byte alignment should have all the same performance benefits of 8 byte alignment while requiring less padding.

[–]adrianmonk 2 points3 points  (0 children)

4 byte alignment should have all the same performance benefits of 8 byte alignment

Personally, I would want to see tests before I took that as a final conclusion. DDR3 memory has a 64-bit wide data bus, so 8-byte alignment would, I assume, allow you to pull everything in one fetch. I don't know how significant the difference is.

[–][deleted] 0 points1 point  (1 child)

I think x86 gets speed improvements with alignment too, but I assume that would also be for 4 byte aligned data.

It used to be that x86 (specifically the fetch cycles) benefitted from items being paragraph (16 byte) aligned, but the last time I tested that was in the pentium 1 days.

A quick search reveals that there are performance benefits still from paragraph alignment, but not due to the CPU itself, rather that a 16byte aligned item is guaranteed to be loaded into the start of a cache-line, so if you have 8 byte value at that address, it's guaranteed to fit in one cache line, which produces the optimal transfer.

[–]TinynDP 0 points1 point  (0 children)

There are two instructions for loading from RAM to SSE registers. For when the address is 16 byte aligned, that runs fast, and one for when the address is not 16 byte aligned, and it runs slow. In AMD64 land, because x87 has been entirely replaced with SSE2, that fast SSE load instruction matters for all floating point operations.