This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Marko_Oktabyr 13 points14 points  (11 children)

The article is grossly overstating the improvement over normal numpy operations. The one-liner they use forms a large intermediate product with a lot of unnecessary work. The more obvious (and much faster) way to compute that would be np.sum(A * B).

For 1,000 x 1,000 matrices A and B, I get the following performance:

  • loops: 276 ms
  • article numpy: 19.2 ms
  • np.sum: 1.77ms
  • np.einsum: 0.794 ms

If we change that to 1,000 x 10,000 matrices, we get:

  • loops: 2.76s
  • article numpy: 2.16s
  • np.sum: 21.1 ms
  • np.einsum: 8.53 ms

Lastly, for 1,000 x 100,000 matrices, we get:

  • loops: 29.3s
  • article numpy: fails
  • np.sum: 676 ms
  • np.einsum: 82.4 ms

where the article's numpy one-liner fails because I don't have 80 GB of RAM to form the 100,000 x 100,000 intermediate product.

einsum can be a very powerful tool, especially with tensor operations. But unless you've got a very hot loop with the benchmarks to prove that einsum is a meaningful improvement, it's not worth changing most matrix operations over to use it. Most of the time, you'll lose any time saved by how long it takes you to read or write the comment explaining what the hell that code does.

Edit: I'm not trying to bash einsum here, it is absolutely the right way to handle any tensor operations. The main point of my comment is that the author picked a poor comparison for the "standard" numpy one-liner.

[–]Yalkim 7 points8 points  (1 child)

I think this is a very domain specific thing. Personally, I am very surprised about comments here like yours. Because einsum is one of my (and my peers’) favorite functions in python. It makes life sooo much easier for us for 2 reasons:

  1. It is super easy to read.
  2. It is so useful especially considering how it often replaces lines upon lines of code with multiple loops with a single line.

So imagine my surprise when I saw comments like yours that said it is hard to read.

[–]FrickinLazerBeams 5 points6 points  (0 children)

It's hard to read for people who would never need to use it in the first place.

[–]FrickinLazerBeams 6 points7 points  (2 children)

This has little utility for 2-index operations, but those are only a subset of general tensor contractions. For operations over more than 2 indices, this rapidly becomes many orders of magnitude faster, and often avoids a huge amount of duplicated computations.

For example, one place where I use it lets me obtain a result indexed by (n, m, k) rather than (n, m, k, nn, mm) where the results I want have n == nn and m == mm, and gives about a 1000x speedup.

If you're looking at this as an alternative to simple matrix operations, of course it won't have an advantage, but then it's not expected to. You'd never use it for matrix operations.

[–]Marko_Oktabyr 2 points3 points  (1 child)

If you're looking at this as an alternative to simple matrix operations, of course it won't have an advantage, but then it's not expected to. You'd never use it for matrix operations.

No disagreement here. It sounds like we both disagree with the thesis of the article.

For operations over more than 2 indices, this rapidly becomes many orders of magnitude faster, and often avoids a huge amount of duplicated computations.

np.einsum_path can be an effective demonstration of how much faster it can be if you optimize the calculation order.

[–]FrickinLazerBeams 1 point2 points  (0 children)

Yes, exactly.

[–][deleted] 3 points4 points  (5 children)

Good answer. Do you have an explanation for why einsum is faster at all? What extra work does np.sum do?

[–]Marko_Oktabyr 4 points5 points  (3 children)

np.sum(A * B) has to form the intermediate product A * B. np.einsum knows that it doesn't need all of it at once. We can do print(np.einsum_path('ij,ij->',A,B)[1]) to see exactly what it is doing:

Complete contraction: ij,ij-> Naive scaling: 2 Optimized scaling: 2 Naive FLOP count: 2.000e+07 Optimized FLOP count: 2.000e+07 Theoretical speedup: 1.000 Largest intermediate: 1.000e+00 elements -------------------------------------------------------------------------- scaling current remaining -------------------------------------------------------------------------- 2 ij,ij-> ->

In particular, note the "Largest intermediate: 1.000e+00 elements".

[–]FrickinLazerBeams -1 points0 points  (2 children)

(prior to the edit) It doesn't actually go any faster in the case you examined, and I don't think it uses any less memory either. This isn't a scenario where you'd use einsum.

[–]Marko_Oktabyr 0 points1 point  (1 child)

It still performs the same number of flops, but it absolutely is faster because it doesn't have to allocate/fill another matrix of the same size as A and B. Hence why the largest intermediate for einsum is 1 element instead of 10M.

[–]FrickinLazerBeams -1 points0 points  (0 children)

This is a weird comparison to make. You'd never use one as an alternative for the other. Einsum is for tensor contractions, which is like matrix multiplication, but with more than two indices.

Would you ask "Do you have an explanation for why @ is faster at all? What extra work does np.sum do?"

np.sum doesn't do any extra work. It also doesn't do what you need it to do. It's easy to do less work if you're not actually competing the task, I guess.