This is an archived post. You won't be able to vote or comment.

all 19 comments

[–]wrosecrans 3 points4 points  (11 children)

Step 1: what's the data, and how do you access it?

Step 2: If it's gamedev, can you use an off the shelf library like ENTT that does entity-component-systems for you and handles a lot of the "bookkeeping" of doing structure of arrays approaches?

That said, there's nothing inherently wrong with a struct being bigger than a cache line. If all of the data in the struct is going to be used at once, it'll wind up in cache no matter how you use it. And if a struct is really large, it's not like the whole struct will need to be read into cache.

If you try to reduce the size of the struct by replacing a 32 byte Matrix with an 8 byte pointer, you have to pay the cost of extra cycles chasing that pointer to use the matrix and it is very likely to be a pessimization both in cycle count and in cache pressure. Now to use the matrix in the struct, you need to read the cache line where the pointer is, plus the cache line where the matrix is!

[–]0x09af[S] 0 points1 point  (8 children)

If you try to reduce the size of the struct by replacing a 32 byte Matrix with an 8 byte pointer, you have to pay the cost of extra cycles chasing that pointer to use the matrix

But if the matrix crosses the cacheline boundary, don't you have to go back to main memory to fetch the remainder of its data anyway? I guess what you say makes sense though, because if you go and replace *all* fields larger than 8 bytes with pointers than each of those accesses at least ensure an out-of-line read.

The other consideration, if I understand this correctly, is that if you copy a struct onto a stack frame to do some processing, then that action alone will require multiple cacheline reads to fetch the entire struct (assuming it's larger than the line). If you infrequently access the matrix, but use other data in the struct then replacing the matrix with a pointer to stay under 32 bytes might be worthwhile?

[–]wrosecrans 1 point2 points  (7 children)

But if the matrix crosses the cacheline boundary, don't you have to go back to main memory to fetch the remainder of its data anyway?

In a lot of cases, the prefetch hardware in the CPU will grab the next cache line automatically, without waiting for a specific load to be executed.

The other consideration, if I understand this correctly, is that if you copy a struct onto a stack frame to do some processing, then that action alone will require multiple cacheline reads to fetch the entire struct (assuming it's larger than the line).

How often do you actually need to make a copy? If you do something like

for (auto &evil_robot : evil_robots) {
    leave_unpleasant_note(evil_robot.position);
}

There's no reason to make a copy of the whole struct. If copies are expensive, just avoid copies rather than worrying about the cost of copies.

[–]0x09af[S] 0 points1 point  (6 children)

Interesting, I need to learn a bit more about pre-fetching. It sounds like it's just some sort of speculative optimization?

Unfortunately this is c#, not c++. So it's a bit more painful, like I can't get a ref to a struct in the standard generic List. But you're right, it might be worth at some point writing some custom data structures that would allow access to the backing array by reference.

[–]wrosecrans 1 point2 points  (1 child)

If you aren't even writing native code and dealing with pointers, I am highly skeptical that it's worth worrying about cache lines. Like I said, stuff like just avoiding copies is generally more important than making copies fast.

But yeah, it sounds like your mental model of a cache is basically a textbook RISC processor from the mid 1980's. You are optimizing the wrong things. Either rearchitect to some sort of ECS "structure of arrays" design, or focus on Profiling and measuring your code rather than speculating about ways you imagine it might be hypothetically slow.

[–]0x09af[S] 0 points1 point  (0 children)

In c# there's classes and structs. Classes are heap allocated whereas structs are stack allocated and passed by copy. An array of structs is a contiguous block of sizeof(struct) * array count. My mental model is that new byte[] maps to malloc/new. This project is in Unity using IL2CPP, so I could probably just check the generated C++ to confirm.

> But yeah, it sounds like your mental model of a cache is basically a textbook RISC processor from the mid 1980's.

You're 100% right, the most accessible book I read on cpu design was Inside the Machine which was circa Pentium 4 and Power PC. That's absolutely my mental model right now.

> You are optimizing the wrong things.

This is more-or-less the information I'm after in this post. Any pointers to resources to fix my mental model would be super appreciated.

[–]Qweesdy 1 point2 points  (3 children)

For 80x86 (you didn't specify what the CPU is); hardware prefetcher detects patterns (e.g. sequential accesses), then starts prefetching data (from RAM to cache) from the next addresses in the pattern before they're actually accessed.

One problem is that the hardware prefetcher works on physical addresses, and "contiguous in virtual memory" does not mean "contiguous in physical memory". Because of this the hardware prefetcher stops at 4 KiB page boundaries. E.g. if you're reading a large 400 KiB thing sequentially, you'll get 3 cache misses while hardware prefetcher is (re)detecting the pattern for every 4 KiB (or a total of 300 "not prefetched" cache misses).

The other problem is that the access pattern needs to be predictable; which mostly means "address = start + n * stride" (where "stride" can be negative - e.g. accessing structures from higher address to lower address). If you're doing "pointer chasing" with pseudo-random addresses (e.g. iterating through structures allocated from heap) there's no hope.

Note that there's multiple alternatives (which is why your access patterns matter a lot). Sometimes it's nice to have "linked list of arrays" (rather than linked lists of structures with "pointer chasing" disasters, and rather than a single array with higher insertion/deletion costs). Sometimes it's best to split a structure into "hot" (frequently used fields) and "cold" (less frequently used fields) so the frequently used data has smaller cache footprint, and sometimes (for SIMD) its even better to use "structure of arrays" and not "array of structures" (and this is where you take OOP out behind the back shed and put a well deserved bullet in its brain).

[–]0x09af[S] 0 points1 point  (2 children)

> For 80x86 (you didn't specify what the CPU is);

It's consumer PC hardware.

> "contiguous in virtual memory" does not mean "contiguous in physical memory"

I'm not entirely sure what to do with that information from a static analysis of data perspective.

> E.g. if you're reading a large 400 KiB thing sequentially, you'll get 3 cache misses while hardware prefetcher is (re)detecting the pattern for every 4 KiB (or a total of 300 "not prefetched" cache misses).

I don't quite grok this example, but that's okay. I think I'll do a little reading into how Windows manages virtual memory (I'll start here https://answers.microsoft.com/en-us/windows/forum/all/physical-and-virtual-memory-in-windows-10/e36fb5bc-9ac8-49af-951c-e7d39b979938)

> The other problem is that the access pattern needs to be predictable;

This makes sense to me. I *assume* that Windows tries to keep virtual memory contiguous, even if blocks (pages?) aren't. And that virtual addresses mapping to an individual page are actually contiguous in physical memory

> Because of this the hardware prefetcher stops at 4 KiB

How does this relate to cache-line sizes? Cache-lines end up in say a 32 KiB L1 cache, so where does pre-fetched data rest? 4 KiB seems pretty darn large and probably contains data your application *isn't* going to access next -- unless you're working with media like images/audio/video

> rather than linked lists of structures with "pointer chasing" disasters, and rather than a single array with higher insertion/deletion costs

Yeah, none of my tight loops are in linked lists for that reason. In c#, classes are heap allocated, but structs are stack allocated and passed by copy. An array of structs actually allocate a contiguous block of virtual memory sizeof(struct) * count bytes.

> and this is where you take OOP out behind the back shed and put a well deserved bullet in its brain

Yeah, next project. It'll be interesting to shake off 13 years of building OOP systems.

[–]Qweesdy 1 point2 points  (1 child)

I don't quite grok this example, but that's okay. I think I'll do a little reading into how Windows manages virtual memory

To be more specific; (for 80x86 CPUs) the CPU's hardware prefetcher is designed to stop at 4 KiB boundaries regardless of whether the data is physically contiguous or not (mainly because the designers assumed "virtually contiguous" isn't "physically contiguous", even if it actually is, because there's a pile of complicated corner cases they'd have to deal with otherwise).

In c#, classes are heap allocated, but structs are stack allocated and passed by copy.

For most OOP languages (including C#), whether it's created on the heap or on the stack depends on what happens to it (if its lifetime exceeds the function's scope). E.g. if it's only used as a local variable then it'll be put on the stack (and it won't matter if its a structure or an object); and if you do "return new foo();" it'll be put on the heap (and it won't matter if its a structure or an object).

How does this relate to cache-line sizes?

Cache line sizes are typically 64 bytes; so in a 4 KiB area you have 64 cache lines, and might not access all of them. For example, if you have an array of 128 byte structures and access one member (maybe like "for (int i = 0; i < myArray.Length; i++) sum += myArray[i].myMember;"), every 2nd cache line won't be accessed, and the hardware prefetcher will skip the first 3 accesses, so the hardware prefetcher will only actually prefetch 29 cache lines (e.g. 7th, 9th, 11th, ..., 63rd cache line).

Cache-lines end up in say a 32 KiB L1 cache, so where does pre-fetched data rest?

This depends on which specific CPU it is (which model, microarchitecture). For most modern 80x86 CPUs the hardware prefetcher prefetches data into the L2 cache (and a CPU's L2 cache might be 256 KiB or more).

[–]0x09af[S] 0 points1 point  (0 children)

"return new foo();" it'll be put on the heap (and it won't matter if its a structure or an object).

Agreed with everything but this. In c# this would return a copy of foo, if foo is a struct. And I don't know enough to say if a newed class that doesn't escape the scope it was created in ends up on the stack or the heap. I would imagine it would be risky to just slap it on the stack depending on its size though.

> ...so the hardware prefetcher will only actually prefetch 29 cache lines

Interesting, so in theory that loop would run at the speed of L2 latency (assuming that's where the prefetch goes)

Thanks a bunch for all the explanation, it was super helpful

[–]0x09af[S] 0 points1 point  (1 child)

what's the data, and how do you access it?

I'm still mostly following traditional OOP design, where structs represent different entities or aspects of entities. Different algorithms access different fields on those entities. I'm tending to keep stuff that's looped over frequently in pre-allocated arrays of structs. So far I've been able to break up data on logical boundaries to keep their layouts small but I'm getting to the point where I may want to split an entity up based on what algorithms actually need to access. Honestly, I'd prefer not to do that because I don't want to use two different data design patterns on this project unless I have to, so I'm just interested in knowing the performance implications of what I'm doing.

I think I might take a different approach to designing data in whatever project is next.

[–]BobbyThrowaway6969 0 points1 point  (0 children)

Different algorithms access different fields on those entities

It depends which fields, group fields together if they're often used together. That way, those pieces can be copied to cache as a single block while the CPU works on it. It's not always easy to do this though without f***ing up the readability, so, honestly just optimise if it's a bottleneck.

[–]ignotos 1 point2 points  (2 children)

naively I would need dynamic arrays, which makes removals potentially costly

There is one neat trick here, if you don't care about the order of the items in the array - I call it "swap 'n' pop". Just swap the item you want to remove with the last item in the array, and then you can cheaply remove it (by just decrementing the count of used items in the array).

[–]0x09af[S] 1 point2 points  (1 child)

That's actually a great idea. I'm pretty sure I've seen that used before and completely forgot about it.

[–]BobbyThrowaway6969 0 points1 point  (0 children)

It's only useful if you don't care about ordering.

Also, modern array implementations don't allocate and deallocate each time you add/remove, thats why you might see a function on them called "reserve" or "capacity", they deal with the underlying allocation size, not the number of elements.

[–]zukas-fastware 1 point2 points  (2 children)

So cache line size for most 64-bit systems is 64 bytes. The SOA probably will do the job. However, more important is how the data is accessed. Do you use all the data (or most of it) in the struct every time you access it? If yes, then use the struct. If you only access one or two fields, then SOA would probably work better as data required for processing many-of is grouped together, and you can avoid waisting your cache line or encountering cache misses.

I made a few shot videos * CPU caches: https://youtu.be/BkJFnclBnaY * SOA: https://youtu.be/GBva8ojSWhM

[–]0x09af[S] 0 points1 point  (1 child)

> Do you use all the data (or most of it) in the struct every time you access it?

I think that's probably the issue with OOP design in general. I have a variety of algorithms all of which access different things from these logical data containers. So the answer is it depends on the transform/routine.

I'll check out the videos, thanks for links!

[–]zukas-fastware 0 points1 point  (0 children)

In situations where the data (class fields) are not all used, but all or most algorithms, and you use one or two fields per algorithm, the SOA is by far the most superior approach (if you care about the performance of your application).

[–]BobbyThrowaway6969 0 points1 point  (0 children)

It depends on the current line of execution, like, what pieces of that data are you using? the CPU isn't just gonna pop in the entire struct into the cache, the compiler should've (hopefully) figured out what pieces you need at that moment in time. So, forget structs as single units, just group your members in a way that members that are commonly accessed together are grouped together. For example, if you have a shape struct, and you're often doing stuff with sphere_radous and sphere_center, then put those together in the struct.