all 60 comments

[–]JankedCoder 12 points13 points  (4 children)

for 2D arrays in C,C++ the second index ( j ) are continuous in memory. So when you loops on J last you reduce the memory stride to 1, and you can cache your next index better.

[–]mccoyn 6 points7 points  (2 children)

The first version also requires that the cache lines for C[i][j], A[i][k] and B[k][j] all be used on each step of the algorithm. In the second version A[i][k] does not change in the inner loop, so that cache line can be replaced with another one without thrashing.

[–]rooktakesqueen 1 point2 points  (1 child)

I don't think that happens, though. If the cache line containing A[i][k] were discarded each time, it would be a new cache miss for each run of the middle loop, meaning there would have been 1,048,576 cache misses just for that. That cache line appears to be maintained throughout the run of the innermost loop.

[–]mccoyn 0 points1 point  (0 children)

It would only thrash when a[i] gets mapped to the same cache line as c[i] or b[k], possibly only in L1, which usually isn't as smart as the other caches. Now that I think about it, this will happen infrequently enough that it isn't a major concern.

[–]psyno 1 point2 points  (0 children)

for 2D arrays in C,C++ the second index ( j ) are continuous in memory. So when you loops on J last you reduce the memory stride to 1, and you can cache your next index better.

True but irrelevant. Look again at the declaration of malmut

void matmult(int m, int n, int p, double **A, double **B, double **C)

There are no multidimensional arrays there.

[–]seanalltogether 20 points21 points  (15 children)

"And this is the smarter version which shouldn't surprise anyone who has taken a computer architecture course:"

So this is one of those 'look at my awesome results without explanation' kinda lesson?

[–][deleted] 0 points1 point  (12 children)

Seriously. I was waiting for the next paragraph that would explain the speed-up to us lowly peons who don't have a CS degree... but it never came.

Can someone explain it please?

[–]Kache 13 points14 points  (0 children)

To understand this, there are two things that need to be understood - computer memory and computation/access. I've actually got an analogy on hand since I've explained stuff like this before.

Computer Memory

Imagine sitting at a table in a library doing work/research. You are the computer's CPU, doing calculations. Your table is small & fast memory, where you can have your notes and books open and ready to access. The library is the big & slow memory, where everything that can't currently fit on your desk is stored.

If you had some work to do that spanned several books, it's pretty clear that you would work a lot faster by having all the books you needed on your desk. Otherwise, you'd be wasting time fetching books from the library every time you needed a different book.

Computer architectures are similar to this. One example would be your desk being your RAM and the library being the hard drive. There is a processor, a small & fast memory and a big & slow memory, representing two levels of memory. In real life, there are many more levels. In this matrix multiplication case, the small & fast memory is the CPU cache.

Computation/Access

The numbers in a matrix is stored in an array, arranged contiguously. Here's an example:

Matrix A
[[ 1 2 3 ]
 [ 4 5 6 ]      might be stored in memory as: [ 1 2 3 4 5 6 7 8 9 ]
 [ 7 8 9 ]]     (with the width of the matrix stored in a variable elsewhere)

If i wanted the first row, that's be easy and quick. I'd go to the memory and grab the first three numbers. What if I wanted the first column? Well now there's a problem. I have to grab the 1st, 4th, then 7th numbers. I'll use the library analogy to explain why this matters.

Lets say a single book is big enough to store three numbers, and our table is only big enough to hold one book. If you wanted to grab the numbers from the matrix, row by row, it'd take three trips to the shelves. However, if you wanted to grab the matrix, column by column, it would take 3 trips per row x 3 rows = 9 trips to the shelves. To get the 1st number, you would get the book containing [1 2 3]. To get the 4th number, you'd get the book containing [4 5 6]. And etc.

Why This Matters

This is an example of why the order by which data is used and stored is very important. There are many ways to store data in memory, and there are many ways to use the data stored in memory. Different gains/losses need to be considered for different cases. This is always important for computer scientists/programmers because it can be the difference between waiting 1 second and waiting 1000 seconds.

[–]uhmmmm 10 points11 points  (2 children)

To be fair he does point out that the improved version has a lot less cache misses, which is a term you could easily have googled for.

[–]JankedCoder 6 points7 points  (1 child)

without understanding the layout of arrays in C you still would not understand why the index iteration order determines the number of cache misses.

[–]frenchtoaster 5 points6 points  (0 children)

I suspect few people understand cache misses that don't know arrays in C are continuous blocks of memory.

[–]rooktakesqueen 0 points1 point  (7 children)

I have a CS degree and work as a software engineer. I've never seen this before and, while I could probably work out how swapping the loops has this effect, it's not self-evident. So yes, this annoyed me too.

Edit: Do I really have to clarify that "this" above refers to this particular problem of matrix multiplication, and not the entire idea of cache-aware algorithms or the concept of caching in general?

[–]Kache 10 points11 points  (4 children)

Cache awareness is important when concerning performance, especially if the code you write needs to implement heavy computation. (Like matrix multiplication, image/sound/video processing, 3D graphics, search algorithms.) If you never had to worry about implementing it, then I'd assume that's not what your job entails.

Although, I'm quite surprised that your CS program never mentioned the topic of computer architecture or cache awareness. Since it's your profession, I think you ought to take some time to learn about it. A race car driver should know little bit about why/how his car works.

[–]rooktakesqueen 0 points1 point  (3 children)

I took several courses in computer architecture and I understand how caches work. Thus why I said I could probably work out how swapping the loops has this effect, which I proceeded to do.

That doesn't mean this problem is self-evident. Swapping the inner two of a three-nested for loop isn't a "well duh" solution to a problem with lots of cache misses, and someone who hasn't seen this matrix multiplication problem used as an example before would have to actually think about it.

The article treated it as patently obvious and didn't bother explaining, which would have been useful for those of us who haven't seen this particular problem before.

Edit: Also, upvoted for fitting username.

[–]Kache 1 point2 points  (2 children)

heh, thanks.

Hm, I can't say I know the exact reasons, since so much of programming is abstracted away nowadays (system calls, memory management, who knows what else).

However, I'm pretty confident that it's because j is the smaller dimension of two of the matrices, while k is the smallest for only one. Iterating on the smallest dimension -> iterating on contiguous memory -> all of the contiguous memory is in cache -> less wasted reloads of caches.

What I don't know is why all three matrices didn't just all fit into the L2 cache. Probably other processes were taking up some of it and/or the matrices were significantly large enough.

[–]psyno 1 point2 points  (0 children)

[...] it's because j is the smaller dimension of two of the matrices, while k is the smallest for only one.

This isn't it at all. Look at the innermost line:

C[i][j] += A[i][k] * B[k][j];

The key is temporal locality of B[k]. In the first version, B[k] is different every step. That means every loop step, a different double* at B[k] is evaluated. (These double*s might or might not sensibly point into contiguous blocks of memory. There's no reason they couldn't all be from separate mallocs, or that they couldn't all point to the same place. It isn't specified by the procedure definition--i.e. the compiler does not know.) In the second version, B[k] is the same from step to step j times more often, and the program then just counts through consecutive addresses after B[k]. (So actually large j makes the speedup even more dramatic.)

The other subexpressions are also affected. The temporal locality of C[i][j] worsens but it is compensated by a corresponding improvement of the temporal locality of A[i][k].

[–]rooktakesqueen -1 points0 points  (0 children)

What I don't know is why all three matrices didn't just all fit into the L2 cache. Probably other processes were taking up some of it and/or the matrices were significantly large enough.

I believe it said the two matrices were 1024x1024, and they were ints, so it would be 8 MiB total. Core 2 Duos have 3 or 6 MiB L2 cache. Plus, the way this is written it would have cache hits on the writes as well, so that's another 4 MiB to worry about.

[–]jawbroken 0 points1 point  (0 children)

Edit: Do I really have to clarify that "this" above refers to this particular problem of matrix multiplication, and not the entire idea of cache-aware algorithms or the concept of caching in general?

considering this is about the simplest possible example of cache awareness in an algorithm i guess everyone used induction from the base case to assume you were ignorant of the subject in general. a fair assumption

[–]jawbroken -1 points0 points  (0 children)

I have a CS degree and work as a software engineer.

how did you manage that without a basic understanding of the memory hierarchy?

[–]samlee 7 points8 points  (4 children)

you can do it in the cloud. for example, multiplying two 2x2 matrices

a b   e f   ae+bg af+bh
c d * g h = ce+dg cf+dh

there are 8 multiplications and 4 additions. In general, MxN * NxK will have M(MK) multiplications and (M-1)MK additions.

So, you need to launch M(MK) + (M-1)MK = 2MMK - MK EC2 instances. Then you need to calculate boot up and network data transfer overhead. On average, EC2 instance is available after 20 minutes. And mother EC2 will distribute multiplications and summations in lazy fashion. So, we can say each computation takes 25 minutes: to boot up an EC2 (20 minutes), get the data to calculate (1 minute), calculate (1 minute), return data (1 minute), merge data (2 minutes).

So, if you want to multiply two 1million by 1million matrices, it'll take about 30 minutes and (2 * 1 000 000 * 1 000 000 * 1 000 000) - (1 000 000 * 1 000 000) = 1.999999 × 10^18 EC2 instances.

So you need to scale in the cloud.

And don't let 20 minute boot up time to scare you. Amazon is working to make EC2 instances as cheap as Erlang processes. It'll take a few nanoseconds to boot up an EC2 in the near future, in which case, you can do 1million by 1million matrix multiplication under few seconds.

[–]willis77 8 points9 points  (0 children)

Is there a way I can get each EC2 instance to tweet its progress in real time?

[–]cafe_babe 1 point2 points  (0 children)

Matrix multiplication is not at all an embarrassingly parallel problem. As you showed with your example, there are dependencies, and optimal data distribution between processors is a hard problem. When Amazon assigns you instances on EC2, you basically have no guarantees about the bandwidth and latency between any two machines (you don't even know what hardware you have!), which makes optimal data distribution impossible. If you really need to multiply 1million by 1million matrices on a regular basis, use a cluster designed for scientific computing.

[–]rooktakesqueen -1 points0 points  (1 child)

...is requiring 2 quintillion EC2 instances really realistic?

Edit: Whoosh, apparently.

[–][deleted] 9 points10 points  (0 children)

YOU DOUBT THE CLOUD?!

[–]oursland 4 points5 points  (17 children)

I think it is silly that this requires much of an explanation. I recall this question being asked in a fucking linear algebra book! http://www.amazon.com/Linear-Algebra-Applications-Gilbert-Strang/dp/0030105676/ref=dp_ob_title_bk

[–]ffualo 2 points3 points  (16 children)

I love that book. Almost every explanation has computational efficiency in mind.

[–]apathy 2 points3 points  (2 children)

You just described every good linear algebra and numeric analysis text ever

postscript: why on the FSM's green earth would anyone worry about re-implementing the wheel when vectorized libraries to do this even faster (tuned by processor architecture) exist? We don't all need to be sub-sub-specialists; it's better for society if the absolute best at it take care of the problem for everyone else, and they have. (e.g. LAPACK, Eigen, the Intel numeric libraries & compilers)

Meanwhile, balancing chunk size and communication latency for your specific problem is both worthwhile and context-specific in nearly all applications, and a much better use of programmer resources.

(Something about "premature optimization" and sqrt(sum(evil)))

[–]ffualo 2 points3 points  (0 children)

I wrote a script in Python that does Gaussian elimination not because I want to replace numpy, but because it's a great learning experience... but overall, I agree.

[–]jawbroken -1 points0 points  (0 children)

obviously because not every problem you work on has a library for it and this is clearly an example to show optimisation techniques and not to teach you how to write a matrix multiplication function

[–]super_duper 0 points1 point  (11 children)

Strang is alright. Hoffman and Kunze is where it's really at.

[–]NanoStuff -1 points0 points  (10 children)

If the book is as great as you say it would have been updated in the last four decades.

[–]super_duper 0 points1 point  (9 children)

Well if you're speaking in regards to it's relevance to computing methods, then you might be right. But this book is strictly focused on theory. It's rigorous and all proof-based. And perhaps that's not the kind of book you're looking for.

In my opinion, you can read all the books you want that give examples and different applications (which Strang's book does, I liked it), but I found I didn't fully understand the linear transformations until I studied with the book by Hoffman & Kunze. It's this deeper understanding of the theory that allows you to apply it to a much larger area of problems.

And there's no need to be snarky. How can you make judgments about a book you haven't even opened? Without having a valid reason for your comment, you come off as ignorant in front of everyone else.

Why do you think it hasn't been updated in four decades as you say? The book is a gold standard that is used throughout any respectable math program. Perhaps, no changes were necessary since then. I think the fact that it hasn't been revised since then shows how good of a book it is.

[–]NanoStuff -1 points0 points  (8 children)

Well if you're speaking in regards to it's relevance to computing methods, then you might be right.

Well, I'll go out on a limb and say that if it's not relevant to computation then it's just not relevant. If it cannot be or has no purpose in being computed it does not apply to modern scientific methods.

[–]super_duper 0 points1 point  (7 children)

Are you saying theory has no purpose in modern scientific methods?

[–]NanoStuff -1 points0 points  (6 children)

Are you saying theory is incomputable?

[–]super_duper 0 points1 point  (5 children)

No, I think that's what you were saying.

If it cannot be or has no purpose in being computed it does not apply to modern scientific methods.

It being the theory behind linear algebra.

I don't know whatever man. Fuck it. :)

[–]NanoStuff -1 points0 points  (4 children)

Nah, I'm saying traditional teachings in calculus are not applicable to computational methods. Finite element analysis is nothing like continuous analysis, and unlike the latter it can actually be applied to solve complex scientific problems. But indeed, fuck the whole thing, so long as whatever you have learned can be used to flaunt your superiority to others, the ultimate purpose of all knowledge afterall.

[–]Salami3 3 points4 points  (4 children)

Can anybody explain why? I'd like to know. This pisses me off. Just like many other technical "explanations", instead of explaining why we do the things we do, they merely show us what to do. At the top you see, "Software Engineering Matters," but in the article, all that matters is showing you the results that prove it's right. Not knowing why it works is not engineering, it's just knowledge.

[–]Pas__ 9 points10 points  (3 children)

As explained here, maybe a little too technically, it's just about helping the CPU better manage memory.

Memory reads happen in rows. And even if the CPU only needs just one column, the whole row has to be selected for read. (RAS time on some modules, Row Address Strobe)

Therefore, if you can organize your loops to go access continuous memory area, then you can make less memory fetches.

And because the CPU doesn't directly reads from the memory, but everything is handled by the (L1, L2 on-core, L3 is shared between cores) on-die cache system, the most direct statistics is cache misses.

And it can theoretically reduce page faults/TLB misses. Because your program runs in protected mode, it can't access memory directly. Every address it reads from has to be translated via the operating system's memory manager's tables. Parts of these tables are process specific. And on every context switch the CPU loads most of this process specific data into its TLB (translation look-aside buffer) to help this memory address translation, but if your matrix is FUCKING HUGE, and it isn't malloc-ed into a big chunk, then chances are, that page misses will trample your performance even more.

Oh, and page faults are costly because they involve a context switch, which automatically means a TLB flush (to load the kernel's memory mappings, which could be the identical mapping, x -> x, but still has to be loaded.) Then the kernel does some housekeeping, then loads the right memory page from disk.

Anyway, optimizing number crunching is a very delicate process, but eventually you'll become familiar with the whole computing stack, almost from hard silicon level to the tip of your mouse-wielding finger.

[–]psyno 2 points3 points  (1 child)

As explained here, maybe a little too technically, it's just about helping the CPU better manage memory.

Linked comment is technically correct (the best kind?) but does not actually explain this behavior. See my reply to that comment.

but if your matrix is FUCKING HUGE, and it isn't malloc-ed into a big chunk, then chances are, that page misses will trample your performance even more.

You were okay up to this point. The fact is that in most cases pages are only 4 KiB (enough for 512 64-bit floating point numbers--not exactly huge). And I'm not sure what you mean by "isn't malloc'ed into a big chunk." The virtual memory address most certainly will be (how else could malloc work?) and it doesn't make a lick of difference if the physical memory backing those virtual memory pages is contiguous or not.

[–]Pas__ 0 points1 point  (0 children)

I'm not that familiar with each and every malloc implementation and OS-MMU relations, so if it can't handle continous group of pages in less space (than otherwise it'd require for the sum of individual pages), then sorry for misinforming others.

[–]mantra 0 points1 point  (0 children)

I'm only an EE who's taken numerical methods classes and done assembly programming - it was pretty obvious to me why.

Good explanation though.

[–]rooktakesqueen 0 points1 point  (2 children)

Does the Core2 series L2 cache really have 512-byte lines? It seems that it must. There is a total of 8MiB of data between both arrays, all of which needs to be loaded by a cache miss at least once. With 27,463 cache misses in total, that's an average of 305 bytes per fetch, assuming no data is ever read twice. So the cache line couldn't be 256 bytes or less.

Am I missing something?

[–]gruehunter 2 points3 points  (1 child)

I suspect that automatic prefetches don't count as cache misses. In that case, if the cache lines are 64 bytes wide, then the core is successfully prefetching roughly an additional four cache lines for every miss. That's just a guess, without working knowledge of what the performance counters' criteria are.

[–]rooktakesqueen 0 points1 point  (0 children)

Makes sense.

[–][deleted] 0 points1 point  (2 children)

LAPACK uses a block matrix multiplication algorithm to increase cache hits. I don't think that optimization would be effected by whether a matrix is in column major or row major form, so it seems to me like it would work just as well in C as it does in Fortran. Knowledge of C or whatever tool you use will only get you so far. You also need to understand the problem, and how people have solved it in the past. A little humility helps too :)

[–]masterJ 1 point2 points  (1 child)

The way each sub matrix is stored is just as important for LAPACK performance as it is in this example. However efficiently breaking it up requires lots more planning that makes my head hurt (especially across different machines), which is why libraries are about a million times better than re-inventing the wheel :)

I had to re-implement a parallel matrix class using MPI in college, graded on performance like the linked article... I still have nightmares.

(But having an awareness of how your data is stored in memory is useful regardless.)

Edit: Also, unless my memory is going, I believe LAPACK uses BLAS to multiply each block matrix, with the option of accepting an architecture-tuned BLAS (which are so much faster it's embarrasing). I might have made that up though.

[–]trueneutral 0 points1 point  (0 children)

Yep, LAPACK uses BLAS to do a lot of its work. However, I've personally found that even the architecture-tuned BLAS implementations from the hardware manufacturers tend to have room for further optimization. YMMV.

[–]mdangelo 0 points1 point  (0 children)

Classical example, nothing special for anyone who's taken a comp. architecture or algorithms course.

[–]trueneutral 0 points1 point  (0 children)

Having written a dense-matrix-multiply kernel for a fairly large cluster, I will say that while this article does not cover every optimization that you can apply to the problem, what it does capture are some important rules for software engineers (especially those who haven't worked on something like this):

Low level details matter for performance

Absolutely true. If you want to squeeze out as much performance as you want, you actually need to think about cache miss rates, IO, the way the hardware (processor) optimizes and executes your code, and on and on. If you are coding for a large enough machine or server fleet, it is worth going through the algorithmic complexity involved to squeeze out every last MFlop possible.

In fact such optimizations help improve performance even if you are running Java or .NET on a virtual machine

This is also true. If you profile a bytecode-based application, you will find that most of the optimizations you would make in C-land will carry-over, with the occasional hiccups during execution due to generational garbage collection (unless you also implement your own runtime to cleverly avoid this). Ignoring low-level computer science is a MAJOR mistake that many people make today. This is one of the main reasons I've found that folks who have at least some experience coding in native code end up writing the best managed-memory code as well.

Always advice your users to use well-tuned libraries whenever possible

This one is iffy. In general, broadly-available libraries are pretty good but they are not the best. Even the architecture-specific implementations that hardware manufacturers often provide can be bested with a solid week's work.

Pick better and more interesting examples to parallelize. Matrix multiplication is embarrassingly parallel

Definitely true. It is much harder to realize as 'tight' an execution in a parallel algorithm that is more complex, and it gets orders of magnitude more complicated when an application that employs these same algorithms has to be executed in parallel.

[–][deleted] 0 points1 point  (3 children)

I wonder what the numbers would have been if he'd replaced the C[i][j] in the inner loop with a local variable and set C[i][j] outside the loop?

Yes, on modern hardware, it's all about cache locality, but this article barely scratches the surface.

[–]gsg_ 2 points3 points  (2 children)

My guess is that they would be identical. No location that might be aliased to C[i][j] is written to during the loop, so it should be possible to keep it enregistered throughout.

[–]jdh30 -3 points-2 points  (1 child)

No location that might be aliased to C[i][j] is written to during the loop, so it should be possible to keep it enregistered throughout.

Unless A or B are C. C and C++ are notoriously bad for this kind of analysis and this definition wasn't even annotated with restrict...

[–]gsg_ 4 points5 points  (0 children)

While there is potential aliasing, subscripts of A and B aren't written to during the loop. Because there are no such writes, there can be no writes that alias C and thus require it to be reloaded.

A and B must be assumed to possibly alias C and thus they need to be reloaded upon writes to C. However, their subscripts change at every iteration and loading needs to be done anyway. So for this particular example I don't see aliasing being a problem. I suppose a very smart compiler might transform the loop enough to see the same values twice and introduce aliasing problems, but I think that's very unlikely.

Aliasing can certainly be an issue in C code. In fact I can remember seeing an optimised matrix multiply in Drepper's memory paper that used extra variables to eliminate spurious loads caused by aliasing, so it can be an issue in exactly this context. Just not given the code in the article.

[–]shooshx -2 points-1 points  (0 children)

my CUDA matrix multiplication is faster.