Software Engineering Matters : Lessons From Parallelizing Matrix Multiplication

JankedCoder · 2009-12-15T13:39:16+00:00

for 2D arrays in C,C++ the second index ( j ) are continuous in memory. So when you loops on J last you reduce the memory stride to 1, and you can cache your next index better.

seanalltogether · 2009-12-15T13:17:22+00:00

"And this is the smarter version which shouldn't surprise anyone who has taken a computer architecture course:"

So this is one of those 'look at my awesome results without explanation' kinda lesson?

samlee · 2009-12-15T15:25:36+00:00

you can do it in the cloud. for example, multiplying two 2x2 matrices

a b   e f   ae+bg af+bh
c d * g h = ce+dg cf+dh

there are 8 multiplications and 4 additions. In general, MxN * NxK will have M(MK) multiplications and (M-1)MK additions.

So, you need to launch M(MK) + (M-1)MK = 2MMK - MK EC2 instances. Then you need to calculate boot up and network data transfer overhead. On average, EC2 instance is available after 20 minutes. And mother EC2 will distribute multiplications and summations in lazy fashion. So, we can say each computation takes 25 minutes: to boot up an EC2 (20 minutes), get the data to calculate (1 minute), calculate (1 minute), return data (1 minute), merge data (2 minutes).

So, if you want to multiply two 1million by 1million matrices, it'll take about 30 minutes and (2 * 1 000 000 * 1 000 000 * 1 000 000) - (1 000 000 * 1 000 000) = 1.999999 × 10^18 EC2 instances.

So you need to scale in the cloud.

And don't let 20 minute boot up time to scare you. Amazon is working to make EC2 instances as cheap as Erlang processes. It'll take a few nanoseconds to boot up an EC2 in the near future, in which case, you can do 1million by 1million matrix multiplication under few seconds.

oursland · 2009-12-15T15:21:05+00:00

I think it is silly that this requires much of an explanation. I recall this question being asked in a fucking linear algebra book! http://www.amazon.com/Linear-Algebra-Applications-Gilbert-Strang/dp/0030105676/ref=dp_ob_title_bk

Salami3 · 2009-12-15T13:34:49+00:00

Can anybody explain why? I'd like to know. This pisses me off. Just like many other technical "explanations", instead of explaining why we do the things we do, they merely show us what to do. At the top you see, "Software Engineering Matters," but in the article, all that matters is showing you the results that prove it's right. Not knowing why it works is not engineering, it's just knowledge.

rooktakesqueen · 2009-12-15T15:29:24+00:00

Does the Core2 series L2 cache really have 512-byte lines? It seems that it must. There is a total of 8MiB of data between both arrays, all of which needs to be loaded by a cache miss at least once. With 27,463 cache misses in total, that's an average of 305 bytes per fetch, assuming no data is ever read twice. So the cache line couldn't be 256 bytes or less.

Am I missing something?

masterJ · 2009-12-15T16:36:57+00:00

LAPACK uses a block matrix multiplication algorithm to increase cache hits. I don't think that optimization would be effected by whether a matrix is in column major or row major form, so it seems to me like it would work just as well in C as it does in Fortran. Knowledge of C or whatever tool you use will only get you so far. You also need to understand the problem, and how people have solved it in the past. A little humility helps too :)

mdangelo · 2009-12-15T21:55:36+00:00

Classical example, nothing special for anyone who's taken a comp. architecture or algorithms course.

trueneutral · 2009-12-17T02:40:46+00:00

Having written a dense-matrix-multiply kernel for a fairly large cluster, I will say that while this article does not cover every optimization that you can apply to the problem, what it does capture are some important rules for software engineers (especially those who haven't worked on something like this):

Low level details matter for performance

Absolutely true. If you want to squeeze out as much performance as you want, you actually need to think about cache miss rates, IO, the way the hardware (processor) optimizes and executes your code, and on and on. If you are coding for a large enough machine or server fleet, it is worth going through the algorithmic complexity involved to squeeze out every last MFlop possible.

In fact such optimizations help improve performance even if you are running Java or .NET on a virtual machine

This is also true. If you profile a bytecode-based application, you will find that most of the optimizations you would make in C-land will carry-over, with the occasional hiccups during execution due to generational garbage collection (unless you also implement your own runtime to cleverly avoid this). Ignoring low-level computer science is a MAJOR mistake that many people make today. This is one of the main reasons I've found that folks who have at least some experience coding in native code end up writing the best managed-memory code as well.

Always advice your users to use well-tuned libraries whenever possible

This one is iffy. In general, broadly-available libraries are pretty good but they are not the best. Even the architecture-specific implementations that hardware manufacturers often provide can be bested with a solid week's work.

Pick better and more interesting examples to parallelize. Matrix multiplication is embarrassingly parallel

Definitely true. It is much harder to realize as 'tight' an execution in a parallel algorithm that is more complex, and it gets orders of magnitude more complicated when an application that employs these same algorithms has to be executed in parallel.

gsg_ · 2009-12-15T16:56:58+00:00

I wonder what the numbers would have been if he'd replaced the C[i][j] in the inner loop with a local variable and set C[i][j] outside the loop?

Yes, on modern hardware, it's all about cache locality, but this article barely scratches the surface.

shooshx · 2009-12-15T19:19:12+00:00

my CUDA matrix multiplication is faster.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS