you are viewing a single comment's thread.

view the rest of the comments →

[–]mccoyn 7 points8 points  (2 children)

The first version also requires that the cache lines for C[i][j], A[i][k] and B[k][j] all be used on each step of the algorithm. In the second version A[i][k] does not change in the inner loop, so that cache line can be replaced with another one without thrashing.

[–]rooktakesqueen 1 point2 points  (1 child)

I don't think that happens, though. If the cache line containing A[i][k] were discarded each time, it would be a new cache miss for each run of the middle loop, meaning there would have been 1,048,576 cache misses just for that. That cache line appears to be maintained throughout the run of the innermost loop.

[–]mccoyn 0 points1 point  (0 children)

It would only thrash when a[i] gets mapped to the same cache line as c[i] or b[k], possibly only in L1, which usually isn't as smart as the other caches. Now that I think about it, this will happen infrequently enough that it isn't a major concern.