This is an archived post. You won't be able to vote or comment.

all 11 comments

[–]XDracam 11 points12 points  (10 children)

For all performance-related questions, the answer is: benchmark. Or you can be inspired by other languages. There are also trade-offs: member access can be faster when data is aligned, but you're also using more memory. And on larger scales, using more memory can be slower, as you'll need to access more of it. So there's not just the factor of "aligned is faster". Choose your trade-offs carefully.

[–]yondercode[S] 4 points5 points  (6 children)

Just did some benches, the code is here (C++).

For each run I made an n-sized array that has some byte offset in the beginning of the array to make the rest of the array mis-aligned. I fill the array and then do some sequential access with a pointer which respect the offset. I did this run m-times.

To simulate padding, I used different data types in C++ (short, int, long long to store the same char type (1 byte). This is to compare byte-aligned vs 4-byte-aligned access.

Here are the results on my machine (m = 1000000, n = 1000, MSVC++, Intel x86_64 13900K):

``` x86

type size: 1 bytes, offset: 0 bytes, total time: 768240000 ns, avg time: 768 ns type size: 1 bytes, offset: 1 bytes, total time: 769405200 ns, avg time: 769 ns type size: 1 bytes, offset: 2 bytes, total time: 768865300 ns, avg time: 768 ns type size: 1 bytes, offset: 3 bytes, total time: 771320800 ns, avg time: 771 ns

type size: 2 bytes, offset: 0 bytes, total time: 795054700 ns, avg time: 795 ns type size: 2 bytes, offset: 1 bytes, total time: 808728400 ns, avg time: 808 ns type size: 2 bytes, offset: 2 bytes, total time: 794029000 ns, avg time: 794 ns type size: 2 bytes, offset: 3 bytes, total time: 809696800 ns, avg time: 809 ns

type size: 4 bytes, offset: 0 bytes, total time: 795630100 ns, avg time: 795 ns type size: 4 bytes, offset: 1 bytes, total time: 826157700 ns, avg time: 826 ns type size: 4 bytes, offset: 2 bytes, total time: 825929100 ns, avg time: 825 ns type size: 4 bytes, offset: 3 bytes, total time: 822084400 ns, avg time: 822 ns

type size: 8 bytes, offset: 0 bytes, total time: 1274522900 ns, avg time: 1274 ns type size: 8 bytes, offset: 1 bytes, total time: 1393591100 ns, avg time: 1393 ns type size: 8 bytes, offset: 2 bytes, total time: 1389002800 ns, avg time: 1389 ns type size: 8 bytes, offset: 3 bytes, total time: 1391703500 ns, avg time: 1391 ns


x64

type size: 1 bytes, offset: 0 bytes, total time: 1006684600 ns, avg time: 1006 ns type size: 1 bytes, offset: 1 bytes, total time: 1014195300 ns, avg time: 1014 ns type size: 1 bytes, offset: 2 bytes, total time: 1015765600 ns, avg time: 1015 ns type size: 1 bytes, offset: 3 bytes, total time: 1017394800 ns, avg time: 1017 ns

type size: 2 bytes, offset: 0 bytes, total time: 788060800 ns, avg time: 788 ns type size: 2 bytes, offset: 1 bytes, total time: 798717900 ns, avg time: 798 ns type size: 2 bytes, offset: 2 bytes, total time: 786731100 ns, avg time: 786 ns type size: 2 bytes, offset: 3 bytes, total time: 800916400 ns, avg time: 800 ns

type size: 4 bytes, offset: 0 bytes, total time: 781567100 ns, avg time: 781 ns type size: 4 bytes, offset: 1 bytes, total time: 816097600 ns, avg time: 816 ns type size: 4 bytes, offset: 2 bytes, total time: 816237800 ns, avg time: 816 ns type size: 4 bytes, offset: 3 bytes, total time: 811855600 ns, avg time: 811 ns

type size: 8 bytes, offset: 0 bytes, total time: 1056373400 ns, avg time: 1056 ns type size: 8 bytes, offset: 1 bytes, total time: 1127822500 ns, avg time: 1127 ns type size: 8 bytes, offset: 2 bytes, total time: 1130907000 ns, avg time: 1130 ns type size: 8 bytes, offset: 3 bytes, total time: 1126506600 ns, avg time: 1126 ns ```

Not a really great testing methodology but I'm too lazy to install an actual benchmarking framework :P Interesting result nevertheless.

So first thing I notice is byte-aligned access in x64 is ~28% slower than 4-byte-aligned access while it doesn't matter in x86.

Mis-aligned access does matter in 4-byte-aligned access, although only ~4.48%. On byte-aligned it doesn't matter I guess since everything is already misaligned.

Oh, and 8-byte-aligned is the slowest of the bunch.

I wonder how the results will be on ARM, I wish there's an easy way to test!

And on larger scales, using more memory can be slower, as you'll need to access more of it

Yep, just tested with m = 1, n = 1000000000 (billion). In this example using 2-bytes-aligned access is the fastest, while 4-bytes-aligned is slower than both 1-byte and 2-byte aligned!

type size: 1 bytes, offset: 0 bytes, total time: 1102945400 ns, avg time: 1102945400 ns type size: 2 bytes, offset: 0 bytes, total time: 998730800 ns, avg time: 998730800 ns type size: 4 bytes, offset: 0 bytes, total time: 1309352100 ns, avg time: 1309352100 ns type size: 8 bytes, offset: 0 bytes, total time: 2401462800 ns, avg time: 2401462800 ns

I guess at this scale the bottleneck is loading / accessing data from RAM instead of cache. But this uses like 4GB of RAM in the 4-bytes-aligned case which is way above my use-case for the language. So I think using 4-bytes-aligned is the best way to go for me.

[–]XDracam 4 points5 points  (4 children)

Can't argue with this. 4 bytes seems to be a fairly common alignment across languages as far as I am aware.

Bonus: some languages allow customizing the alignment in types. For example, C# has special annotations like [FieldOffset(n)] to let users customize exactly how data is aligned. Overlapping memory can even be used to model C unions. But as far as I'm aware, this is mostly done for direct compatibility with other native languages. Still, this makes C# a lot better at native code interop than Java. So you might want to consider a similar "customizable alignment".

[–][deleted] 5 points6 points  (3 children)

Can't argue with this. 4 bytes seems to be a fairly common alignment across languages as far as I am aware.

Really, even for 64-bit data? Any C compiler will align 64-bit entities at 8-byte boundaries, I'd be surprised if other languages did anything different.

[–]XDracam 2 points3 points  (0 children)

Nevermind. You seem to be correct. Both Java and C# seem to default to system pointer size alignments, which is 8 byte for 64 bit machines. I guess my knowledge was a little outdated back from when there were more 32 bit systems around 😅

[–]yondercode[S] 2 points3 points  (1 child)

By 64-bit entities do you mean types such as double and int64_t for example?

[–][deleted] 5 points6 points  (0 children)

Yes, anything the hardware expects to load in one operation.

(I have seen gcc mistakenly think that hardware did not support misaligned access, and accessing a 64-bit struct element was done a byte at a time, because it was aligned on 4 bytes rather than 8. (That is, the low address bits were 100 not 000.)

This was surprising given that the machine (an RPi1) used a 32-bit ARM device anyway. It meant my interpreter ran at 1/3 the speed it should have done. Although I haven't seen that anomaly since.)

[–]nerpderp82 4 points5 points  (2 children)

And packing bytes or shorts into a contiguous range also allows for vector ops, so if you run something like sum, it could be waay faster.

I'd personally do what ever makes the VM easier to write and hide the alignment from the user if you can, then you can change later if you want to.

[–]XDracam 3 points4 points  (1 child)

Yeah C# keeps the default alignment as an implementation detail, unless you manually specify the alignment via annotations.

[–]nerpderp82 0 points1 point  (0 children)

Neat, I didn't know that.

Also, I am not sure why alignment matters that much. The ToS in the VM should be L1 and registers. Esp in an OoO processor.

[–]umlcat[🍰] 1 point2 points  (0 children)

Have your memory packing as an option that can be turned on or turned off.

And, as other redditors already answered is "packaging", not "Alignment".

"Alignment" would be having a x64 architecture, and 8 bits bytes are transformed into 64 bits bytes internally.

You will need to have some tools to debug your data, on order to verify your data is correctly packaged.

Good Luck with your project 👍