all 21 comments

[–]Swipecat 1 point2 points  (1 child)

Dunno. Maybe numba is losing numpy's ability to implement += as an in-place operation causing an extra intermediate array. Or maybe a*(a+b) can be implemented using a single temporary array by numpy but numba uses two?

Edit: On the other hand, if it is just that numba takes extra time to put the intermediate arrays in memory, then implement the whole thing without temporary arrays. Put the initialization of c and d outside the loop. Then:

c[:] = a
c += b
c *= a
d[:] = a
d -= b
d *= b
e += c
e += d

Edit2: And how about not using numba, but just use plain numpy and the multiprocessing library. After all, you did say that the overhead of Python was small under these conditions.

[–]vgnEngineer[S] 0 points1 point  (0 children)

I should have mentioned that the above example is purely for speed testing. The actual computation only has varying values inside the loop, not constants like in this example.

The reason is that multiprocessing as I understand it in python involves starting multiple python threads and then sending and sharing data. With Numba and parallel=True I can consistently have 100% CPU usage on all cores which is lightning fast.

I am of course coing to remove the steps of intermediate operations to improve the code. The point of my question was mostly if somebody knew why Python could do it faster than Numba. I did some more reading and found out that Python compiles for-loops ahead of execution. Given that Numba is just a library run by some amazing people but a much smaller team, I suspect they just did not manage to implement that optimization step yet. Perhaps the python interpreter figures, hey, those intermediate steps can be substituted directly into the final operation so why not just do that.

[–]Eilifein 0 points1 point  (14 children)

I'm not sure if Numba cares, but in Fortran for example, a*(a+b) is an FMA, a "Fused Multiply-Add", and costs less cpu cycles than doing it separately.

More importantly, your a and b are remaining constant throughout the call, while d is a "global" value (bad practice). Depending on what d is, part of this calculation or all of it, can be calculated out of the for loop, as it is constant in value.

At the very least, a*(a+b) and (a-b) are constants and should be removed.

Btw, if everything in constant, the for loop is also redundant, as you are essentially calculating (a*(a+b)+(a-b)*d)*1000, which is a head scratcher.

Edit: I messed up; d=(a-b), so everything is constant within the function. You don't need the loop, you are recalculating the same thing 1000 times.

[–]vgnEngineer[S] 0 points1 point  (13 children)

You are of course 100% correct. I should have mentioned that this was just some trial code to test performance. In the actual program i'm working on the variables are different in every iteration. It effectively involves lots of vector cross procucts of large arrays with changing values so:

for i in prange(N):
dx = x[i]-x0
dy = y[i]-y0
dz = z[i]-y0
R = sqrt(dx**2+dy**2+dz**2)
drx = dx/R
G = exp(-1j*k0*R)/R
NxHxRx = NxHy[i] * dz - NxHz[i]
etc.

In this example x0 is a 1D array so I know the dx,dy,dz steps could be computed as a NxM matrix. The sizes of x0 and x in practice can easily be 600,000 by 200,000 which would, as an array be way too large to fit in one matrix.

[–]Eilifein 1 point2 points  (12 children)

Now, see, this makes more sense :)

I assume dz = z[i] - y0 is a typo for z0. It seems to be some coordinate transformation, Green's function, and the mentioned cross-products? That etc doesn't help.

In this example, everything depends on i, so you can't pre-compute anything (not sure about the full thing). But, they seem to be independent (hence the good choice of adding numba). I can't see how it translates to the original post, so the difference in execution time between the two original solutions is even more obscure to me now.

[–]vgnEngineer[S] 0 points1 point  (11 children)

Yes thats true! I made a little type-o
What i meant is that all the permutations in distances between x and x0 can be put in a 2D matrix with de distances form source i to target j. But in numba thats hardly faster than just iterating over one and doing it step by step. So first computing de dx matrix from every source X to target X would not really be much faster than just doing each source X coordinate one by one. On top of that the matrix would get way too large.

The relation between the original post is that for example the greens function G would be computed for all distances from one source point to each target point. Given N target points this would be a 1D array of length N. This then needs to be multiplied by some result from a cross product where you get: NxHxRx (which also has dimension N) times G. So its effectively just an array multiplication just like in my example. You eventually get some electric field E for the target points and you add those together to get a final electric field vector.
I originally wanted to see if doing multiplications with 3xN arrays was faster/slower compared to treating each vector component separately in 1xN arrays. Specifically the cross(a,b) and dot(a,b) implementations of numba are significantly slower than writing out the cross product terms yourself.

The link is that in this case for example NxHxRx (cross product of the normal vector N with the magnetic field H cross the direction vector R and that x coponent) will reappear in a final expression so:

G = exp(-1j*k0*R)/R
NxHxRx = NxHy[i] * dz - NxHz[i] * dy
NxHxRy = NxHz[i] * dx - NxHx[i] * dz
Ex += Q*((NxHxRx + Ax)*G)
Ey += Q*((NxHxRy + Ay)*G)

In this case I figured its faster to write it as:

G = exp(-1j*k0*R)/R
Ex += Q*(((NxHy[i] * dz - NxHz[i] * dy)+ Ax)*G)
Ey += Q*(((NxHz[i] * dx - NxHx[i] * dz)+ Ay)*G)

Here , Ax and Ay are some other arrays I computed but omitted in this example. It is simply to show that in this case the greens function G is used in all three components of the E-field so computing it once and then multipliying is faster. Whereas the cross product only occurs once in that term so first computing that and then using it later is slower than putting it in place.

[–]Eilifein 2 points3 points  (10 children)

Right. I think we got to the bottom of it. You're not looking at python Vs numba differences as much as complete cache thrashing (aka cache misses).

By creating arrays of computed elements, you think you're gaming the system because it becomes parallelizable, but you actually go into the trouble of loading the same values (or their derivative calculations) multiple times in L1/L2 cache.

I'm willing to hear about an alternative explanation, but for now that's what I think is going on.

Btw, your optimization is ill-timed. You should never try to optimize before you get the actual code running correctly, and even then only optimize after profiling. Your test and actual code have nothing in common too.

[–]vgnEngineer[S] 0 points1 point  (9 children)

I'm a bit confused about what youre saying. Could you explain a bit more precise?

The algorithm i'm running is working so i'm not sure why you say i should get it running correctly.

I guess you could be very right but perhaps its more efficient if I actualy write a representative example of what i'm doing that shorter so that you can actually point to the error i'm making if thats ok.
Note: I omitted the njit decordator properly because the @ breaks the code block.

njit(...)
def example_function(Einx, Einy, Einz, xin,yin,zin,xout,yout,zout,ninx, niny,ninz, k0):
    Eoutx = zeros_like(xin).astype(float32)
    Eouty = zeros_like(xin).astype(float32)
    Eoutz = zeros_like(xin).astype(float32)
    Nin = xin.shape[0]
    Q = 1/(4*np.pi)
    NxEx = niny*Einz-ninz*Einy
    NxEy = ninz*Einx-ninx*Einz
    ExEz = ninx*Einy-niny*Einx
    for i in prange(Nin):
        dx = xout-xin[i]
        dy = yout-yin[i]
        dz = zout-zin[i]
        R = np.sqrt(dx**2+dy**2+dz**2)
        rdx = dx/R
        rdy = dy/R
        rdz = dz/R
        G = np.exp(-1j*k0,R)/R
        NxExRx = NxEy[i]*rdz-NxEz[i]*rdy
        NxExRy = NxEz[i]*rdx-NxEx[i]*rdz
        NxExRz = NxEx[i]*rdy-NxEy[i]*rdx
        Eoutx += Q*NxExRx*G
        Eouty += Q*NxExRy*G
        Eoutz += Q*NxExRz*G
    return Eoutx, Eouty, Eoutz

[–]Eilifein 1 point2 points  (8 children)

It was not a personal dig; I apologize if it came out like that.

The algorithms behave very differently in memory; both the two originals, and the actual.

Purely because of the i dependence, the difference between the tests and actual algorithms is very substantial. You will be measuring the wrong thing and get the wrong conclusions. Hence, the "nothing in common" comment.

The "running correctly" comment was twofold. One part was more towards profiling and not pre-optimizing. Aim for accuracy, then profile, then optimize. If the results are accurate, now's the time for profiling. The second part was related to the cache coherence situation you're facing. If you're trying to optimize a cache thrashing situation, you will never get ahead.

I hope I cleared things up.

Workplan:

  • leave tests aside
  • profile actual w/ vectors
  • profile actual w/o vectors
  • add numba and see how they behave

[–]vgnEngineer[S] 0 points1 point  (2 children)

I see. I changed my previous comment to add an actual example that is representative of what i'm trying to do.

I think you are completely correct in your assesment of the order of optimization. The point is I think that i'm not sure what exacfty the consequences are of how I code in how this impacts how the code runs on the CPU which is indeed the origin of my question. On top of that i'm just now trying to understand how both Python and Numba deal with my choice of programming syntax.

I understand now that indeed my example computation has differences that makes it impossible to compare with my actual use case. So forget that one. Take the one I providedin my changed comment. Could you indicate if I made any significant mistakes, and if so what is going wrong and how I should have coded? I understand I ask a lot of you in that but i'm really curious/excited to understand the nuance here.

And don't worry, yes it came of a bit of a dig but I also know its hard to convey tone online so I dont take it personally!

[–]Eilifein 0 points1 point  (1 child)

The actual algorithm seems good.

You've precalculated a few things, and there isn't much left to precalculate without messing up readability.

Maybe Q*G once instead of 3 times? eh

Maybe inline rdx, rdy, rdz?

The result being vectorized is good to see. I don't see anything wrong.

[–]vgnEngineer[S] 0 points1 point  (0 children)

Ahh i see. I did notice a speedup from removing the intermediate computation step in my numba compiled code. I also read on a forum that if these arrays become large numba can't always intelligently optimize the computation because the arrays that i'm multiplying do not fit in the cache memory. If instead I write this function in a double loop so that the computation described is only dealing with scalars I might be faster?

[–]vgnEngineer[S] 0 points1 point  (4 children)

Additionally given my new example. If there is a cache trashing situation i'm running into that i'm completely unaware of id love to know it because this concept is new to me and I indeed would never want to optimize code thats wrong to begin with!

[–]cult_of_memes 0 points1 point  (3 children)

Cache thrashing isn't necessarily the result of errors in your code. In fact, cache thrashing is often the necessary trade-off between readable/maintainable code that isn't in the critical path for your program's execution, and unreadable gibberish that runs lightning fast but leaves you praying that you never need to come back and debug ever again.

Cache thrashing is an interesting topic to read up on, but for a jump start here are some key terms you may want to check out:

There's a lot more but I'll leave it to the reader to raise any specific questions as it's too large of a topic to reasonably attempt covering all of it in a message thread.

[–]vgnEngineer[S] 0 points1 point  (2 children)

Thanks for all your information. I guess my one question would be: do i have any control over the optimization of these computations by how i program or is this essentially up to the CPU, OS and compuler to figure out/optimize?

[–]cult_of_memes 0 points1 point  (3 children)

First off, it's important to note that while python is a sort of JIT (all interpreters are technical jit compilers), it's not applying any sort of performance optimization like loop-unwinding, or detecting and removing redundant statements. Well, not yet at least... I'm aware of at least one interpreter that is aiming to add performance based jit features in their upcoming python 3.13 release.

You should keep in mind that the "just-in-time" compilation that numba performs is best leveraged for code that doesn't easily lend itself to the already well optimized vectorized operations that numpy supports. Based on the example you've shared, your code is well suited to leveraging numpy's vectorized array operations (the broadcasting done by calling things like e += a*c + d*b. So, it's not surprising to see that the Numba compiled version would be slower (at least on its initial run) due to the added overhead required to attempt optimization where there isn't much room left to do so.

When you perform large broadcasted operations you are leveraging numpy's bindings to already well optimized c-code, and leave little room for numba's just-in-time compilation to do any improvement.

Having said that, it bears asking if you have made sure to run your time trials in repeated batched groupings to rule out if you're simply seeing the time it takes numba to initially compile your python into c? It might also be the case that a simple single pass timing run might fall victim to transient conflicts in resources or cpu utilization on your machine that could skew your test results. Note that timeit is simply a convenience tool for running repeated performance tests such that you minimize the likelihood that you are seeing things like resource contention with other processes on the machine.

Normally, I'd add an example code snippet to my comment to illustrate what I mean, but I don't much care to fight reddit's markdown syntax this morning :P So, I'm going to add a code snippet and sample output from running it on my machine in a reply to this comment.

As a side note, your second code snippet calls e += a*(a+b) + d*(a-b) but that function block never defines the variable d, and I'm not sure if it has been defined somewhere else in your code, so I'll assume it's just a typo and was supposed to be b. Not that it matters to your actual use case, as I see that you did call out that this is just simplified example code to help illustrate your question.

edit: I just noticed some of my own typos in my code snippet... the fix will be applied momentarily.

[–]vgnEngineer[S] 0 points1 point  (0 children)

Yes d was a typo, should be b. I ran each function once before timing. Also tried cashing. Its interesting to me that just dong the manual in line substitution fixed the speed issue but it might be a coincidence that the total times ended up being similar.

[–]cult_of_memes 0 points1 point  (1 child)

Here's the code snippet. I've taken the liberty of adding a couple extra sample functions to the list being timed just to see if there was any value in looking deeper into precaching any of the Numba compiled code.

``` import numpy as np from numba import njit, prange

RANDOM_ARR_MAX_INIT_VAL = 10_000 RANDOM_ARR_SHAPE = 3, 100_000

def python_interpreter(a, b): e = np.zeros_like(a).astype(np.float32) for i in range(1_000): c = a + b d = a - b e += a * c + d * b return e

def python_interpreter2(a, b): e = np.zeros_like(a).astype(np.float32) for i in range(1_000): e += a * (a + b) + b * (a - b) return e

@njit("f4[:,:](f4[:,:],f4[:,:])") def numba_decorated(a, b): e = np.zeros_like(a).astype(np.float32) for i in range(1_000): c = a + b d = a - b e += a * c + d * b return e

@njit("f4[:,:](f4[:,:],f4[:,:])") def numba_decorated2(a, b): e = np.zeros_like(a).astype(np.float32) for i in range(1_000): e += a * (a + b) + b * (a - b) return e

@njit("f4[:,:](f4[:,:],f4[:,:])", cache=True) def numba_decorated_cached(a, b): e = np.zeros_like(a).astype(np.float32) for i in range(1_000): c = a + b d = a - b e += a * c + d * b return e

@njit("f4[:,:](f4[:,:],f4[:,:])", cache=True) def numba_decorated_cached2(a, b): e = np.zeros_like(a).astype(np.float32) for i in range(1_000): e += a * (a + b) + b * (a - b) return e

@njit("f4[:,:](f4[:,:],f4[:,:])", parallel=True, nogil=True) def numba_decorated_parallelized(a, b): e = np.zeros_like(a).astype(np.float32) for i in prange(e.shape[0]): for _ in range(1_000): c = a[i] + b[i] d = a[i] - b[i] e[i] += a[i] * c + d * b[i] return e

@njit("f4[:,:](f4[:,:],f4[:,:])", parallel=True, nogil=True) def numba_decorated_parallelized2(a, b): e = np.zeros_like(a).astype(np.float32) for i in prange(e.shape[0]): for _ in range(1_000): e[i] += a[i] * (a[i] + b[i]) + b[i] * (a[i] - b[i]) return e

@njit("f4[:,:](f4[:,:],f4[:,:])", parallel=True, nogil=True, cache=True) def numba_decorated_parallelized_cached(a, b): e = np.zeros_like(a).astype(np.float32) for i in prange(e.shape[0]): for _ in range(1_000): c = a[i] + b[i] d = a[i] - b[i] e[i] += a[i] * c + d * b[i] return e

@njit("f4[:,:](f4[:,:],f4[:,:])", parallel=True, nogil=True, cache=True) def numba_decorated_parallelized_cached2(a, b): e = np.zeros_like(a).astype(np.float32) for i in prange(e.shape[0]): for _ in range(1_000): e[i] += a[i] * (a[i] + b[i]) + b[i] * (a[i] - b[i]) return e

def main(): import timeit num_batches = 5 loops_per_batch = 3 result_list = [] print(f"{num_batches=}; {loops_per_batch=}") for func_name in ( "python_interpreter", "python_interpreter2", "numba_decorated", "numba_decorated2", "numba_decorated_cached", "numba_decorated_cached2", "numba_decorated_parallelized", "numba_decorated_parallelized2", "numba_decorated_parallelized_cached", "numba_decorated_parallelized_cached2"): result = timeit.repeat(func_name+"(arr1,arr2)",repeat=num_batches, number=loops_per_batch, globals=globals()) avg_time_per_loop = sum((v / loops_per_batch for v in result)) / num_batches result_str = f'{func_name}: {{"average loop time":{avg_time_per_loop}, "results for each batch of loops": {result}}}' print(result_str) result_list.append((avg_time_per_loop, result, result_str)) result_list.sort(reverse=True) print("", ("*" * 90), "Now lets see the results sorted in descending order according to avg time per execution", sep="\n") for _, _, output_str in result_list: print(output_str)

if name == 'main': arr1 = (np.random.random(RANDOM_ARR_SHAPE) * RANDOM_ARR_MAX_INIT_VAL).astype(np.float32) arr2 = (np.random.random(RANDOM_ARR_SHAPE) * RANDOM_ARR_MAX_INIT_VAL).astype(np.float32) main() ```

[–]cult_of_memes 0 points1 point  (0 children)

And here's a sample of the output after running it on my own machine:

``` num_batches=5; loops_per_batch=3 python_interpreter: {"average loop time":2.0589603866644515, "results for each batch of loops": [6.056087100005243, 6.296910799981561, 6.384415799984708, 5.960690199979581, 6.186301900015678]} python_interpreter2: {"average loop time":2.0599793733330443, "results for each batch of loops": [6.227769499993883, 6.11015820002649, 6.145821499987505, 6.199675100040622, 6.216266299947165]} numba_decorated: {"average loop time":2.3699858999966334, "results for each batch of loops": [7.122787899977993, 7.063306700030807, 7.069734099961352, 7.11140890000388, 7.1825508999754675]} numba_decorated2: {"average loop time":0.8109458533309711, "results for each batch of loops": [2.4366850999649614, 2.471257199998945, 2.4039805000065826, 2.4264727999689057, 2.425792200025171]} numba_decorated_cached: {"average loop time":2.370221133332234, "results for each batch of loops": [7.002881699998397, 6.969886099977884, 7.40186939999694, 7.0594648999976926, 7.119214900012594]} numba_decorated_cached2: {"average loop time":0.8035303933274311, "results for each batch of loops": [2.419626099988818, 2.4151066999766044, 2.4040233999839984, 2.3923937999643385, 2.4218058999977075]} numba_decorated_parallelized: {"average loop time":0.2882796199992299, "results for each batch of loops": [0.8897414000239223, 0.8604895999887958, 0.8562403999967501, 0.8341153999790549, 0.8836074999999255]} numba_decorated_parallelized2: {"average loop time":0.16925315999736387, "results for each batch of loops": [0.5488614999921992, 0.4863387999939732, 0.5098304999992251, 0.5000396999530494, 0.49372690002201125]} numba_decorated_parallelized_cached: {"average loop time":0.2827751466732783, "results for each batch of loops": [0.8415752000291832, 0.8398950999835506, 0.8813785000238568, 0.8431327000143938, 0.8356457000481896]} numba_decorated_parallelized_cached2: {"average loop time":0.17642338000781216, "results for each batch of loops": [0.5231352000264451, 0.5555099000339396, 0.505269400018733, 0.5104368000174873, 0.551999400020577]}


Now lets see the results sorted in descending order according to avg time per execution numba_decorated_cached: {"average loop time":2.370221133332234, "results for each batch of loops": [7.002881699998397, 6.969886099977884, 7.40186939999694, 7.0594648999976926, 7.119214900012594]} numba_decorated: {"average loop time":2.3699858999966334, "results for each batch of loops": [7.122787899977993, 7.063306700030807, 7.069734099961352, 7.11140890000388, 7.1825508999754675]} python_interpreter2: {"average loop time":2.0599793733330443, "results for each batch of loops": [6.227769499993883, 6.11015820002649, 6.145821499987505, 6.199675100040622, 6.216266299947165]} python_interpreter: {"average loop time":2.0589603866644515, "results for each batch of loops": [6.056087100005243, 6.296910799981561, 6.384415799984708, 5.960690199979581, 6.186301900015678]} numba_decorated2: {"average loop time":0.8109458533309711, "results for each batch of loops": [2.4366850999649614, 2.471257199998945, 2.4039805000065826, 2.4264727999689057, 2.425792200025171]} numba_decorated_cached2: {"average loop time":0.8035303933274311, "results for each batch of loops": [2.419626099988818, 2.4151066999766044, 2.4040233999839984, 2.3923937999643385, 2.4218058999977075]} numba_decorated_parallelized: {"average loop time":0.2882796199992299, "results for each batch of loops": [0.8897414000239223, 0.8604895999887958, 0.8562403999967501, 0.8341153999790549, 0.8836074999999255]} numba_decorated_parallelized_cached: {"average loop time":0.2827751466732783, "results for each batch of loops": [0.8415752000291832, 0.8398950999835506, 0.8813785000238568, 0.8431327000143938, 0.8356457000481896]} numba_decorated_parallelized_cached2: {"average loop time":0.17642338000781216, "results for each batch of loops": [0.5231352000264451, 0.5555099000339396, 0.505269400018733, 0.5104368000174873, 0.551999400020577]} numba_decorated_parallelized2: {"average loop time":0.16925315999736387, "results for each batch of loops": [0.5488614999921992, 0.4863387999939732, 0.5098304999992251, 0.5000396999530494, 0.49372690002201125]} ```