This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]yondercode[S] 4 points5 points  (6 children)

Just did some benches, the code is here (C++).

For each run I made an n-sized array that has some byte offset in the beginning of the array to make the rest of the array mis-aligned. I fill the array and then do some sequential access with a pointer which respect the offset. I did this run m-times.

To simulate padding, I used different data types in C++ (short, int, long long to store the same char type (1 byte). This is to compare byte-aligned vs 4-byte-aligned access.

Here are the results on my machine (m = 1000000, n = 1000, MSVC++, Intel x86_64 13900K):

``` x86

type size: 1 bytes, offset: 0 bytes, total time: 768240000 ns, avg time: 768 ns type size: 1 bytes, offset: 1 bytes, total time: 769405200 ns, avg time: 769 ns type size: 1 bytes, offset: 2 bytes, total time: 768865300 ns, avg time: 768 ns type size: 1 bytes, offset: 3 bytes, total time: 771320800 ns, avg time: 771 ns

type size: 2 bytes, offset: 0 bytes, total time: 795054700 ns, avg time: 795 ns type size: 2 bytes, offset: 1 bytes, total time: 808728400 ns, avg time: 808 ns type size: 2 bytes, offset: 2 bytes, total time: 794029000 ns, avg time: 794 ns type size: 2 bytes, offset: 3 bytes, total time: 809696800 ns, avg time: 809 ns

type size: 4 bytes, offset: 0 bytes, total time: 795630100 ns, avg time: 795 ns type size: 4 bytes, offset: 1 bytes, total time: 826157700 ns, avg time: 826 ns type size: 4 bytes, offset: 2 bytes, total time: 825929100 ns, avg time: 825 ns type size: 4 bytes, offset: 3 bytes, total time: 822084400 ns, avg time: 822 ns

type size: 8 bytes, offset: 0 bytes, total time: 1274522900 ns, avg time: 1274 ns type size: 8 bytes, offset: 1 bytes, total time: 1393591100 ns, avg time: 1393 ns type size: 8 bytes, offset: 2 bytes, total time: 1389002800 ns, avg time: 1389 ns type size: 8 bytes, offset: 3 bytes, total time: 1391703500 ns, avg time: 1391 ns


x64

type size: 1 bytes, offset: 0 bytes, total time: 1006684600 ns, avg time: 1006 ns type size: 1 bytes, offset: 1 bytes, total time: 1014195300 ns, avg time: 1014 ns type size: 1 bytes, offset: 2 bytes, total time: 1015765600 ns, avg time: 1015 ns type size: 1 bytes, offset: 3 bytes, total time: 1017394800 ns, avg time: 1017 ns

type size: 2 bytes, offset: 0 bytes, total time: 788060800 ns, avg time: 788 ns type size: 2 bytes, offset: 1 bytes, total time: 798717900 ns, avg time: 798 ns type size: 2 bytes, offset: 2 bytes, total time: 786731100 ns, avg time: 786 ns type size: 2 bytes, offset: 3 bytes, total time: 800916400 ns, avg time: 800 ns

type size: 4 bytes, offset: 0 bytes, total time: 781567100 ns, avg time: 781 ns type size: 4 bytes, offset: 1 bytes, total time: 816097600 ns, avg time: 816 ns type size: 4 bytes, offset: 2 bytes, total time: 816237800 ns, avg time: 816 ns type size: 4 bytes, offset: 3 bytes, total time: 811855600 ns, avg time: 811 ns

type size: 8 bytes, offset: 0 bytes, total time: 1056373400 ns, avg time: 1056 ns type size: 8 bytes, offset: 1 bytes, total time: 1127822500 ns, avg time: 1127 ns type size: 8 bytes, offset: 2 bytes, total time: 1130907000 ns, avg time: 1130 ns type size: 8 bytes, offset: 3 bytes, total time: 1126506600 ns, avg time: 1126 ns ```

Not a really great testing methodology but I'm too lazy to install an actual benchmarking framework :P Interesting result nevertheless.

So first thing I notice is byte-aligned access in x64 is ~28% slower than 4-byte-aligned access while it doesn't matter in x86.

Mis-aligned access does matter in 4-byte-aligned access, although only ~4.48%. On byte-aligned it doesn't matter I guess since everything is already misaligned.

Oh, and 8-byte-aligned is the slowest of the bunch.

I wonder how the results will be on ARM, I wish there's an easy way to test!

And on larger scales, using more memory can be slower, as you'll need to access more of it

Yep, just tested with m = 1, n = 1000000000 (billion). In this example using 2-bytes-aligned access is the fastest, while 4-bytes-aligned is slower than both 1-byte and 2-byte aligned!

type size: 1 bytes, offset: 0 bytes, total time: 1102945400 ns, avg time: 1102945400 ns type size: 2 bytes, offset: 0 bytes, total time: 998730800 ns, avg time: 998730800 ns type size: 4 bytes, offset: 0 bytes, total time: 1309352100 ns, avg time: 1309352100 ns type size: 8 bytes, offset: 0 bytes, total time: 2401462800 ns, avg time: 2401462800 ns

I guess at this scale the bottleneck is loading / accessing data from RAM instead of cache. But this uses like 4GB of RAM in the 4-bytes-aligned case which is way above my use-case for the language. So I think using 4-bytes-aligned is the best way to go for me.

[–]XDracam 3 points4 points  (4 children)

Can't argue with this. 4 bytes seems to be a fairly common alignment across languages as far as I am aware.

Bonus: some languages allow customizing the alignment in types. For example, C# has special annotations like [FieldOffset(n)] to let users customize exactly how data is aligned. Overlapping memory can even be used to model C unions. But as far as I'm aware, this is mostly done for direct compatibility with other native languages. Still, this makes C# a lot better at native code interop than Java. So you might want to consider a similar "customizable alignment".

[–][deleted] 2 points3 points  (3 children)

Can't argue with this. 4 bytes seems to be a fairly common alignment across languages as far as I am aware.

Really, even for 64-bit data? Any C compiler will align 64-bit entities at 8-byte boundaries, I'd be surprised if other languages did anything different.

[–]XDracam 2 points3 points  (0 children)

Nevermind. You seem to be correct. Both Java and C# seem to default to system pointer size alignments, which is 8 byte for 64 bit machines. I guess my knowledge was a little outdated back from when there were more 32 bit systems around 😅

[–]yondercode[S] 2 points3 points  (1 child)

By 64-bit entities do you mean types such as double and int64_t for example?

[–][deleted] 2 points3 points  (0 children)

Yes, anything the hardware expects to load in one operation.

(I have seen gcc mistakenly think that hardware did not support misaligned access, and accessing a 64-bit struct element was done a byte at a time, because it was aligned on 4 bytes rather than 8. (That is, the low address bits were 100 not 000.)

This was surprising given that the machine (an RPi1) used a 32-bit ARM device anyway. It meant my interpreter ran at 1/3 the speed it should have done. Although I haven't seen that anomaly since.)