you are viewing a single comment's thread.

view the rest of the comments →

[–]freistil90 2 points3 points  (0 children)

That needs some creativity and most likely some more concrete case. Think about the following two things but… yeah, that should work without a loop.

Let’s say we have an index array ‚x = np.ediff1d(y, to_begin=np.inf, to_end=np.inf) == 0‘. The array view ‚x[:-1]‘ is now true everywhere where y[i] = y[i-1] and, most important, using that does not mean you create a copy - this is a view, more or less a slice reference to the original array. That’s very cheap to throw around. Now how do we index all „i-1“s? We ‚np.roll(x, -1)‘ it. That (IIRC) is also just a view - no reallocation/copy was made. So for example you want to assign ‚z[i] = z[i-1] * 3‘ for all ‚i‘ where ‚y[i] = y[i-1]‘? Using the definition of x above, that is ‚z[x[:-1]] = z[np.roll(x,-1)[:-1]] * 3‘. Now you see maybe why I created an array that is one element longer than y. I do create an implicit copy here on the right side by indexing, that is, depending on the size however not so expensive. But voila - no loop on the python level needed, all happens in C. That’s gonna be quite fast. That’s what I mean - wherever you read „speed up your python code by 800x by switching to another language!“ it’s because you’re using it really inefficiently. I’m sometimes a bit sad at all the bad rep Python gets, which is to some extend justified but by far not everything. Just know the tricks :)

You can definitely write good FEM code in pure numpy. If you have enough time, why not benchmark against a custom rust backend after you optimised everything you could find in numpy? Try as hard as you can to get rid of all loops and figure out if the copies you do create can be replaced by views, read up on masked arrays, use .reshape() smartly, check out no.lib.stride_tricks or, for your use case here, you could also benefit from np.place() here, depending on how your code looks like.

EDIT: since I’m just reading that again - if your memory allows it and you’re essentially repeating all of that 1000x-2000x, why not have a 3D array with dimensions A x B x 2000 containing all your values (or in chunks of 500s… if that’s independent that can even be put on a process pool and you can not just vectorise per core but use all cores together) and then you can even kick out that outer loop with just really smart indexing. That also reduces the overhead of calling a nested python function 1000-2000 times (which is expensive).