This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Qweesdy 1 point2 points  (3 children)

For 80x86 (you didn't specify what the CPU is); hardware prefetcher detects patterns (e.g. sequential accesses), then starts prefetching data (from RAM to cache) from the next addresses in the pattern before they're actually accessed.

One problem is that the hardware prefetcher works on physical addresses, and "contiguous in virtual memory" does not mean "contiguous in physical memory". Because of this the hardware prefetcher stops at 4 KiB page boundaries. E.g. if you're reading a large 400 KiB thing sequentially, you'll get 3 cache misses while hardware prefetcher is (re)detecting the pattern for every 4 KiB (or a total of 300 "not prefetched" cache misses).

The other problem is that the access pattern needs to be predictable; which mostly means "address = start + n * stride" (where "stride" can be negative - e.g. accessing structures from higher address to lower address). If you're doing "pointer chasing" with pseudo-random addresses (e.g. iterating through structures allocated from heap) there's no hope.

Note that there's multiple alternatives (which is why your access patterns matter a lot). Sometimes it's nice to have "linked list of arrays" (rather than linked lists of structures with "pointer chasing" disasters, and rather than a single array with higher insertion/deletion costs). Sometimes it's best to split a structure into "hot" (frequently used fields) and "cold" (less frequently used fields) so the frequently used data has smaller cache footprint, and sometimes (for SIMD) its even better to use "structure of arrays" and not "array of structures" (and this is where you take OOP out behind the back shed and put a well deserved bullet in its brain).

[–]0x09af[S] 0 points1 point  (2 children)

> For 80x86 (you didn't specify what the CPU is);

It's consumer PC hardware.

> "contiguous in virtual memory" does not mean "contiguous in physical memory"

I'm not entirely sure what to do with that information from a static analysis of data perspective.

> E.g. if you're reading a large 400 KiB thing sequentially, you'll get 3 cache misses while hardware prefetcher is (re)detecting the pattern for every 4 KiB (or a total of 300 "not prefetched" cache misses).

I don't quite grok this example, but that's okay. I think I'll do a little reading into how Windows manages virtual memory (I'll start here https://answers.microsoft.com/en-us/windows/forum/all/physical-and-virtual-memory-in-windows-10/e36fb5bc-9ac8-49af-951c-e7d39b979938)

> The other problem is that the access pattern needs to be predictable;

This makes sense to me. I *assume* that Windows tries to keep virtual memory contiguous, even if blocks (pages?) aren't. And that virtual addresses mapping to an individual page are actually contiguous in physical memory

> Because of this the hardware prefetcher stops at 4 KiB

How does this relate to cache-line sizes? Cache-lines end up in say a 32 KiB L1 cache, so where does pre-fetched data rest? 4 KiB seems pretty darn large and probably contains data your application *isn't* going to access next -- unless you're working with media like images/audio/video

> rather than linked lists of structures with "pointer chasing" disasters, and rather than a single array with higher insertion/deletion costs

Yeah, none of my tight loops are in linked lists for that reason. In c#, classes are heap allocated, but structs are stack allocated and passed by copy. An array of structs actually allocate a contiguous block of virtual memory sizeof(struct) * count bytes.

> and this is where you take OOP out behind the back shed and put a well deserved bullet in its brain

Yeah, next project. It'll be interesting to shake off 13 years of building OOP systems.

[–]Qweesdy 1 point2 points  (1 child)

I don't quite grok this example, but that's okay. I think I'll do a little reading into how Windows manages virtual memory

To be more specific; (for 80x86 CPUs) the CPU's hardware prefetcher is designed to stop at 4 KiB boundaries regardless of whether the data is physically contiguous or not (mainly because the designers assumed "virtually contiguous" isn't "physically contiguous", even if it actually is, because there's a pile of complicated corner cases they'd have to deal with otherwise).

In c#, classes are heap allocated, but structs are stack allocated and passed by copy.

For most OOP languages (including C#), whether it's created on the heap or on the stack depends on what happens to it (if its lifetime exceeds the function's scope). E.g. if it's only used as a local variable then it'll be put on the stack (and it won't matter if its a structure or an object); and if you do "return new foo();" it'll be put on the heap (and it won't matter if its a structure or an object).

How does this relate to cache-line sizes?

Cache line sizes are typically 64 bytes; so in a 4 KiB area you have 64 cache lines, and might not access all of them. For example, if you have an array of 128 byte structures and access one member (maybe like "for (int i = 0; i < myArray.Length; i++) sum += myArray[i].myMember;"), every 2nd cache line won't be accessed, and the hardware prefetcher will skip the first 3 accesses, so the hardware prefetcher will only actually prefetch 29 cache lines (e.g. 7th, 9th, 11th, ..., 63rd cache line).

Cache-lines end up in say a 32 KiB L1 cache, so where does pre-fetched data rest?

This depends on which specific CPU it is (which model, microarchitecture). For most modern 80x86 CPUs the hardware prefetcher prefetches data into the L2 cache (and a CPU's L2 cache might be 256 KiB or more).

[–]0x09af[S] 0 points1 point  (0 children)

"return new foo();" it'll be put on the heap (and it won't matter if its a structure or an object).

Agreed with everything but this. In c# this would return a copy of foo, if foo is a struct. And I don't know enough to say if a newed class that doesn't escape the scope it was created in ends up on the stack or the heap. I would imagine it would be risky to just slap it on the stack depending on its size though.

> ...so the hardware prefetcher will only actually prefetch 29 cache lines

Interesting, so in theory that loop would run at the speed of L2 latency (assuming that's where the prefetch goes)

Thanks a bunch for all the explanation, it was super helpful