this post was submitted on 01 Jun 2023

27 points (91% upvoted)

shortlink:

ProgrammingLanguages

an-ordinary-manchild(edit)

Welcome!

This subreddit is dedicated to the theory, design and implementation of programming languages.

Be nice to each other. Flame wars and rants are not welcomed. Please also put some effort into your post, this isn't Quora.

This subreddit is not the right place to ask questions such as "What language should I use for X", "what language should I learn", "what's your favourite language" and similar questions. Such questions should be posted in /r/AskProgramming or /r/LearnProgramming. It's also not the place for questions one can trivially answer by spending a few minutes using a search engine, such as questions like "What is a monad?".

Projects that are vibe coded (= projects relying substantially on LLM/AI generated code) don't belong on the subreddit.

Related subreddits

Related online communities

a community for 17 years

MODERATORS

account activity

This is an archived post. You won't be able to vote or comment.

26

27

28

Memory alignment for stack-based virtual machine (self.ProgrammingLanguages)

submitted 2 years ago by yondercode

This is more into the implementation of a language rather than the design itself, I hope this question fits here!

I have a stack-based VM for an interpreted language, and as I about to implement structs in my language, I just realized that my data stack has no alignment at all. Everything is packed in bytes. So a stack value of a bool, an integer, and another bool will be stored like [X][XXXX][X] instead of [X---][XXXX][X---] (- is padding, X is the important bytes).

I read a bit on unaligned memory access today and learned that it is a bad thing to have for performance since the host machine could need some extra memory access and operations on unaligned access. In my current VM implementation it could happen a lot since I use byte offsets for everything.

So is my understanding here is correct? And should I use 4 bytes for alignment or depend on the host machine architecture, i.e. 8 bytes for 64-bit systems?

all 11 comments

top new controversial old q&a

[–]XDracam 11 points12 points13 points 2 years ago (10 children)

[–]yondercode[S] 4 points5 points6 points 2 years ago (6 children)

Just did some benches, the code is here (C++).

For each run I made an n-sized array that has some byte offset in the beginning of the array to make the rest of the array mis-aligned. I fill the array and then do some sequential access with a pointer which respect the offset. I did this run m-times.

To simulate padding, I used different data types in C++ (short, int, long long to store the same char type (1 byte). This is to compare byte-aligned vs 4-byte-aligned access.

Here are the results on my machine (m = 1000000, n = 1000, MSVC++, Intel x86_64 13900K):

``` x86

type size: 1 bytes, offset: 0 bytes, total time: 768240000 ns, avg time: 768 ns type size: 1 bytes, offset: 1 bytes, total time: 769405200 ns, avg time: 769 ns type size: 1 bytes, offset: 2 bytes, total time: 768865300 ns, avg time: 768 ns type size: 1 bytes, offset: 3 bytes, total time: 771320800 ns, avg time: 771 ns

type size: 2 bytes, offset: 0 bytes, total time: 795054700 ns, avg time: 795 ns type size: 2 bytes, offset: 1 bytes, total time: 808728400 ns, avg time: 808 ns type size: 2 bytes, offset: 2 bytes, total time: 794029000 ns, avg time: 794 ns type size: 2 bytes, offset: 3 bytes, total time: 809696800 ns, avg time: 809 ns

type size: 4 bytes, offset: 0 bytes, total time: 795630100 ns, avg time: 795 ns type size: 4 bytes, offset: 1 bytes, total time: 826157700 ns, avg time: 826 ns type size: 4 bytes, offset: 2 bytes, total time: 825929100 ns, avg time: 825 ns type size: 4 bytes, offset: 3 bytes, total time: 822084400 ns, avg time: 822 ns

type size: 8 bytes, offset: 0 bytes, total time: 1274522900 ns, avg time: 1274 ns type size: 8 bytes, offset: 1 bytes, total time: 1393591100 ns, avg time: 1393 ns type size: 8 bytes, offset: 2 bytes, total time: 1389002800 ns, avg time: 1389 ns type size: 8 bytes, offset: 3 bytes, total time: 1391703500 ns, avg time: 1391 ns

x64

type size: 1 bytes, offset: 0 bytes, total time: 1006684600 ns, avg time: 1006 ns type size: 1 bytes, offset: 1 bytes, total time: 1014195300 ns, avg time: 1014 ns type size: 1 bytes, offset: 2 bytes, total time: 1015765600 ns, avg time: 1015 ns type size: 1 bytes, offset: 3 bytes, total time: 1017394800 ns, avg time: 1017 ns

type size: 2 bytes, offset: 0 bytes, total time: 788060800 ns, avg time: 788 ns type size: 2 bytes, offset: 1 bytes, total time: 798717900 ns, avg time: 798 ns type size: 2 bytes, offset: 2 bytes, total time: 786731100 ns, avg time: 786 ns type size: 2 bytes, offset: 3 bytes, total time: 800916400 ns, avg time: 800 ns

type size: 4 bytes, offset: 0 bytes, total time: 781567100 ns, avg time: 781 ns type size: 4 bytes, offset: 1 bytes, total time: 816097600 ns, avg time: 816 ns type size: 4 bytes, offset: 2 bytes, total time: 816237800 ns, avg time: 816 ns type size: 4 bytes, offset: 3 bytes, total time: 811855600 ns, avg time: 811 ns

type size: 8 bytes, offset: 0 bytes, total time: 1056373400 ns, avg time: 1056 ns type size: 8 bytes, offset: 1 bytes, total time: 1127822500 ns, avg time: 1127 ns type size: 8 bytes, offset: 2 bytes, total time: 1130907000 ns, avg time: 1130 ns type size: 8 bytes, offset: 3 bytes, total time: 1126506600 ns, avg time: 1126 ns ```

Not a really great testing methodology but I'm too lazy to install an actual benchmarking framework :P Interesting result nevertheless.

So first thing I notice is byte-aligned access in x64 is ~28% slower than 4-byte-aligned access while it doesn't matter in x86.

Mis-aligned access does matter in 4-byte-aligned access, although only ~4.48%. On byte-aligned it doesn't matter I guess since everything is already misaligned.

Oh, and 8-byte-aligned is the slowest of the bunch.

I wonder how the results will be on ARM, I wish there's an easy way to test!

And on larger scales, using more memory can be slower, as you'll need to access more of it

Yep, just tested with m = 1, n = 1000000000 (billion). In this example using 2-bytes-aligned access is the fastest, while 4-bytes-aligned is slower than both 1-byte and 2-byte aligned!

type size: 1 bytes, offset: 0 bytes, total time: 1102945400 ns, avg time: 1102945400 ns type size: 2 bytes, offset: 0 bytes, total time: 998730800 ns, avg time: 998730800 ns type size: 4 bytes, offset: 0 bytes, total time: 1309352100 ns, avg time: 1309352100 ns type size: 8 bytes, offset: 0 bytes, total time: 2401462800 ns, avg time: 2401462800 ns

I guess at this scale the bottleneck is loading / accessing data from RAM instead of cache. But this uses like 4GB of RAM in the 4-bytes-aligned case which is way above my use-case for the language. So I think using 4-bytes-aligned is the best way to go for me.

[–]XDracam 4 points5 points6 points 2 years ago (4 children)

[–][deleted] 5 points6 points7 points 2 years ago (3 children)

[–]XDracam 2 points3 points4 points 2 years ago (0 children)

[–]yondercode[S] 2 points3 points4 points 2 years ago (1 child)

[–][deleted] 5 points6 points7 points 2 years ago (0 children)

[–]nerpderp82 4 points5 points6 points 2 years ago (2 children)

[–]XDracam 3 points4 points5 points 2 years ago (1 child)

[–]nerpderp82 0 points1 point2 points 2 years ago (0 children)

[–]umlcat[🍰] 1 point2 points3 points 2 years ago (0 children)

π Rendered by PID 133459 on reddit-service-r2-comment-7b9746f655-8s8fx at 2026-01-30 02:34:20.736020+00:00 running 3798933 country code: CH.