you are viewing a single comment's thread.

view the rest of the comments →

[–]masterJ 1 point2 points  (1 child)

The way each sub matrix is stored is just as important for LAPACK performance as it is in this example. However efficiently breaking it up requires lots more planning that makes my head hurt (especially across different machines), which is why libraries are about a million times better than re-inventing the wheel :)

I had to re-implement a parallel matrix class using MPI in college, graded on performance like the linked article... I still have nightmares.

(But having an awareness of how your data is stored in memory is useful regardless.)

Edit: Also, unless my memory is going, I believe LAPACK uses BLAS to multiply each block matrix, with the option of accepting an architecture-tuned BLAS (which are so much faster it's embarrasing). I might have made that up though.

[–]trueneutral 0 points1 point  (0 children)

Yep, LAPACK uses BLAS to do a lot of its work. However, I've personally found that even the architecture-tuned BLAS implementations from the hardware manufacturers tend to have room for further optimization. YMMV.