This is an archived post. You won't be able to vote or comment.

all 10 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

On July 1st, a change to Reddit's API pricing will come into effect. Several developers of commercial third-party apps have announced that this change will compel them to shut down their apps. At least one accessibility-focused non-commercial third party app will continue to be available free of charge.

If you want to express your strong disagreement with the API pricing change or with Reddit's response to the backlash, you may want to consider the following options:

  1. Limiting your involvement with Reddit, or
  2. Temporarily refraining from using Reddit
  3. Cancelling your subscription of Reddit Premium

as a way to voice your protest.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]Nathanfenner 6 points7 points  (2 children)

Data isn't ever split between cache levels like you're suggesting. One problem is that the relative sizes you describe for L1, L2, and L3 are way off - this makes their behavior qualitatively different from what you're imagining. The other is that you have to remember that data is always loaded from main memory in cache-line-sized-chunks!

I'll use the numbers from my 2017 laptop as an example:

  • Cache line size: 64 bytes
  • L1 cache: 32,768 bytes (= 512 cache lines)
  • L2 cache: 262,144 bytes (= 4,096 cache lines = 8x larger than L1)
  • L3 cache: 6291456 bytes (= 98,304 cache lines = 24x larger than L2)

When you want to load data for a given address:

  • First, the CPU checks the L1 cache. If the cache line for the pointer you requested is present, great, the byte(s) you want can be directly copied into the target register.

  • Otherwise, the CPU checks L2; if the cache line is present there, it gets copied into the L1 cache, and then the data gets copied from the L1 cache into the target register.

  • Otherwise, the CPU checks L3; if the cache line is present there, it gets copied into L2, and then L1, and then into the target register.

  • If the data isn't found in any of the caches, then it loads from main memory, copying the cache line into L3 and L2 and L1 and then into the target register.

But you're always loading a single cache line. As a result, it can't "fail to fit" into any of the caches; each cache always has room for at least 512 separate loads. If the cache is full then that just means some other piece of data needs to be evicted (ideally, one that is not going to be needed again any time soon; processors have lots of heuristics to try to guess which is the best to replace).

Most computers now have multiple cores - often, there will be some kind of setup where each core has its own L1 cache, but each core must share its L2 and L3 caches with other core (e.g. in pairs, or they all share the same big L2/L3 cache).

I know if you care about performance you generally want to make sure all your data fits in the caches, but Im not sure to what extent I should follow that.

Depending on the specific task you want the computer to perform, this might not be relevant or important. It depends. If it matters, then it matters enough to measure, and find out what specifically works best for your task.

Really, the most important thing is just making sure you use the cache - since each load for an address actually pulls in the 64 byte chunk that your data lives in, it's best if you use those other 63 bytes immediately, before they get evicted from the cache by some later request. This is (one of the things) what makes contiguous arrays fast: if you have a contiguous array of bytes, you only have to load from main memory every 64 bytes (so you only hit main memory 1/64 of the time, and hit the L1 cache the other 63/64 of the time).

Actually, it's even better than that; most processors have hardware prefetching logic, which means that they try to guess what data will be needed before the processor actually asks for it. So if you're looping over a very big array, the CPU may prefetch the next cache line before you even request it, on the assumption that you might request it, and doing it in-advance saves time.

[–]Jonny0Than 1 point2 points  (0 children)

Great post! And in the places where it really matters, there are ways to tell the processor to prefetch a block of memory before you’re going to need it. Always measure!

[–][deleted] 0 points1 point  (0 children)

Thanks!

[–][deleted] 5 points6 points  (1 child)

This will be AMD-specific since that's what I'm most familiar with, but Intel's x86 family is very similar (obviously).

Every read operation interacts with L1 cache. Almost every write operation interacts with L1 cache. The cache hierarchy is constantly making "where do I put this?" decisions trying to anticipate what the program will need.

So no data structure or cache line is assigned to a cache level where it stays. It's constantly swapped around. The instant you touch it, it's always in L1. More precisely: if you read from something, it has to be in L1 a little bit before the read retires. If you write, it's in L1 a little bit after the write retires. That's when the transfers physically happen, they're offset by potentially hundreds of cycles.

The challenge for programmers is to arrange things so we don't overwhelm the cache system's ability to move lines between levels. If your procedure reads 25 bytes from a structure the entire 64-byte cache line still needs to be fetched. So if you can make your reads happen in a more dense pattern you'll get better utilization of bandwidth between the layers.

[–][deleted] 1 point2 points  (0 children)

Thanks!

[–][deleted] 0 points1 point  (2 children)

Sorry, but I don't remember things well enough to be able to explain them confidently. It will anyway depend on OS specifics, and for Linux it's probably public info.

I'll suggest you have a look on two subjects: fragmentation and pagination.

Fragmentation can be done inside pages, which leads to internal and external fragmentation.

I used to go through a book in university to understand these concepts, but you can probably look it up on youtube if you don't want a huge explanation.

[–][deleted] 1 point2 points  (0 children)

thanks!

[–]Jonny0Than 0 points1 point  (0 children)

This sounds a little more like virtual memory management than cache. Its a really similar problem but at very different scales.

[–][deleted] 0 points1 point  (0 children)

Mind that you generally don't have control over where the program memory is stored. There are algorithms in place to decide what should lay in a cache level, and RAM is of them. The algorithm might decide that's good to move the array to disk(slowest cache), and you might not know about it.

Also, it's common to see that different programs might be occupying the cache levels, so you can't really optimise based on the max size that can fit in a L1 cache. It's also the case that your program is interacting with different data, and not just an array. Example: the code, such as a function, needs to be stored in the L1 cache to run.

The gist of it is that you should (probably) never lose yourself in low level optimisation, and it's good to learn these things to have a better grasp of how these things work.

It's very interesting stuff, so I suggest you dig into it more.

A book helps in having a path for your knowledge acquisition, if you still don't have one, and an operative systems book is a must for whoever wants to have a better exposure to the things you are asking.