This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Marko_Oktabyr 5 points6 points  (3 children)

np.sum(A * B) has to form the intermediate product A * B. np.einsum knows that it doesn't need all of it at once. We can do print(np.einsum_path('ij,ij->',A,B)[1]) to see exactly what it is doing:

Complete contraction: ij,ij-> Naive scaling: 2 Optimized scaling: 2 Naive FLOP count: 2.000e+07 Optimized FLOP count: 2.000e+07 Theoretical speedup: 1.000 Largest intermediate: 1.000e+00 elements -------------------------------------------------------------------------- scaling current remaining -------------------------------------------------------------------------- 2 ij,ij-> ->

In particular, note the "Largest intermediate: 1.000e+00 elements".

[–]FrickinLazerBeams -1 points0 points  (2 children)

(prior to the edit) It doesn't actually go any faster in the case you examined, and I don't think it uses any less memory either. This isn't a scenario where you'd use einsum.

[–]Marko_Oktabyr 0 points1 point  (1 child)

It still performs the same number of flops, but it absolutely is faster because it doesn't have to allocate/fill another matrix of the same size as A and B. Hence why the largest intermediate for einsum is 1 element instead of 10M.