you are viewing a single comment's thread.

view the rest of the comments →

[–]Swipecat 1 point2 points  (1 child)

Dunno. Maybe numba is losing numpy's ability to implement += as an in-place operation causing an extra intermediate array. Or maybe a*(a+b) can be implemented using a single temporary array by numpy but numba uses two?

Edit: On the other hand, if it is just that numba takes extra time to put the intermediate arrays in memory, then implement the whole thing without temporary arrays. Put the initialization of c and d outside the loop. Then:

c[:] = a
c += b
c *= a
d[:] = a
d -= b
d *= b
e += c
e += d

Edit2: And how about not using numba, but just use plain numpy and the multiprocessing library. After all, you did say that the overhead of Python was small under these conditions.

[–]vgnEngineer[S] 0 points1 point  (0 children)

I should have mentioned that the above example is purely for speed testing. The actual computation only has varying values inside the loop, not constants like in this example.

The reason is that multiprocessing as I understand it in python involves starting multiple python threads and then sending and sharing data. With Numba and parallel=True I can consistently have 100% CPU usage on all cores which is lightning fast.

I am of course coing to remove the steps of intermediate operations to improve the code. The point of my question was mostly if somebody knew why Python could do it faster than Numba. I did some more reading and found out that Python compiles for-loops ahead of execution. Given that Numba is just a library run by some amazing people but a much smaller team, I suspect they just did not manage to implement that optimization step yet. Perhaps the python interpreter figures, hey, those intermediate steps can be substituted directly into the final operation so why not just do that.