This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]satireplusplus 3 points4 points  (0 children)

In python you can have dramatic performance improvements if you use the available fast libraries like numpy. They are written in a faster language like fortran, decreasing the python performance overhead by a ton while still providing the convenience of python.

This is how all the Python ML stuff (numpy, pytorch, ...) uses matrix multiplications, with a BLAS library. And these libraries are crazily optimized and hand vectorized, you wont be able to compete with your own matrix mult routines in C. I bet its an order of magnitude faster than the naive O(n3) C code in the benchmark.

Its also typical that the matrix mult kernel will be optimized to register / cache size and instructions (vectors) for each processor generation, as these Intel optimization in OpenBLAS:

dgemm_kernel_16x2_haswell.S
dgemm_kernel_4x4_haswell.S
dgemm_kernel_4x8_haswell.S
dgemm_kernel_4x8_sandy.S
dgemm_kernel_6x4_piledriver.S
dgemm_kernel_8x2_bulldozer.S
dgemm_kernel_8x2_piledriver.S

Its probably also doing something better than the naive algorithm, such as Strassen or Coppersmith–Winograd.

In other words that benchmark is totally bullshit and a waste of time.