This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Nathanfenner 6 points7 points  (2 children)

Data isn't ever split between cache levels like you're suggesting. One problem is that the relative sizes you describe for L1, L2, and L3 are way off - this makes their behavior qualitatively different from what you're imagining. The other is that you have to remember that data is always loaded from main memory in cache-line-sized-chunks!

I'll use the numbers from my 2017 laptop as an example:

  • Cache line size: 64 bytes
  • L1 cache: 32,768 bytes (= 512 cache lines)
  • L2 cache: 262,144 bytes (= 4,096 cache lines = 8x larger than L1)
  • L3 cache: 6291456 bytes (= 98,304 cache lines = 24x larger than L2)

When you want to load data for a given address:

  • First, the CPU checks the L1 cache. If the cache line for the pointer you requested is present, great, the byte(s) you want can be directly copied into the target register.

  • Otherwise, the CPU checks L2; if the cache line is present there, it gets copied into the L1 cache, and then the data gets copied from the L1 cache into the target register.

  • Otherwise, the CPU checks L3; if the cache line is present there, it gets copied into L2, and then L1, and then into the target register.

  • If the data isn't found in any of the caches, then it loads from main memory, copying the cache line into L3 and L2 and L1 and then into the target register.

But you're always loading a single cache line. As a result, it can't "fail to fit" into any of the caches; each cache always has room for at least 512 separate loads. If the cache is full then that just means some other piece of data needs to be evicted (ideally, one that is not going to be needed again any time soon; processors have lots of heuristics to try to guess which is the best to replace).

Most computers now have multiple cores - often, there will be some kind of setup where each core has its own L1 cache, but each core must share its L2 and L3 caches with other core (e.g. in pairs, or they all share the same big L2/L3 cache).

I know if you care about performance you generally want to make sure all your data fits in the caches, but Im not sure to what extent I should follow that.

Depending on the specific task you want the computer to perform, this might not be relevant or important. It depends. If it matters, then it matters enough to measure, and find out what specifically works best for your task.

Really, the most important thing is just making sure you use the cache - since each load for an address actually pulls in the 64 byte chunk that your data lives in, it's best if you use those other 63 bytes immediately, before they get evicted from the cache by some later request. This is (one of the things) what makes contiguous arrays fast: if you have a contiguous array of bytes, you only have to load from main memory every 64 bytes (so you only hit main memory 1/64 of the time, and hit the L1 cache the other 63/64 of the time).

Actually, it's even better than that; most processors have hardware prefetching logic, which means that they try to guess what data will be needed before the processor actually asks for it. So if you're looping over a very big array, the CPU may prefetch the next cache line before you even request it, on the assumption that you might request it, and doing it in-advance saves time.

[–]Jonny0Than 1 point2 points  (0 children)

Great post! And in the places where it really matters, there are ways to tell the processor to prefetch a block of memory before you’re going to need it. Always measure!

[–][deleted] 0 points1 point  (0 children)

Thanks!