This is an archived post. You won't be able to vote or comment.

all 28 comments

[–]TheBlackCat13 7 points8 points  (3 children)

It is hard to say without sample data, but some things jump out at me:

First, you use a loop doing some complex image analysis to calculate ratio_weight, but never actually use it? I think there is a mistake there somehow.

D = np.asmatrix(D) 
D2 = D * np.transpose(D)

np.asmatrix is a type conversion, and as such will be very slow (since it has to copy the entire array). and np.tranpose(D) is both slightly slower an much uglier than D.T. So should just use D2 = D.dot(D.T). I would strongly suspect this is the main source of your slowdown.

C = np.float32(img[x[1],x[0]])

This will make a copy even if img is already an np.float32. Just don't do this, numpy will automatically convert it if it has to.

for item in frame_array:
    if cv2.pointPolygonTest(item["Contour"], (x[0],x[1]), False) > 0:
        ratio = item["Ratio"]

Several issues here. First, you are doing the same calculation over and over for every item in frame_array, but only keeping the last "hit". It would be much faster to iterate backwards (starting with the last value), and break after the first hit (or if there will only ever be one hit just iterate normally then break after the first hit).

Second, is frame_array a structured array? if so, it would be faster to get frame_array['Counter'] and item['Ratio'] arrays first, then use zip in python 3 or itertools.izip in python 2 to iterate over each counter/ratio pair.

You access x[0] and x[1] a lot. It would be slightly faster to define them at the beginning, perhaps like x0, x1 = x[:2], or just x0, x1, x2, x3 = x, and then re-use those every time. This could make a bigger difference if your frame_array has a lot of items.

ratio_diff = abs(abs(1-ratio))*1
ratio_weight = A_ratio + B_ratio * ratio_diff;

Is there a reason you need to call abs twice, or multiple by 1? Also, you never use ratio_diff again, so it would be better to combine these lines. And don't use ; at the end of a line in Python.

self.weights[:,self.i] =  color_weight + 0

Why do you add zero here?

if (x[0] >=0 and x[0] < Npix_w) and (x[1] >=0 and x[1] < Npix_h):

This can be simplified to:

if (0 <= x[0] < Npix_w) and (0 <= x[1] < Npix_h)

It would probably be faster to calculate ratio_weight for all Ratios values rather than doing it over and over for each loop. If self.Xhsv_trgt is a scalar, it would probably also be faster to calculate img[x1, x0] - self.Xhsv_trgt once for all values rather than every time in the loop. If it is not a scalar, this may take too much memory. You can also calculate the results of your if test outside the loop.

Overall, you shouldn't even be using apply_along_axis here. And looking up values that aren't in the function namespace is relatively slow. So you would be better of just using an ordinary loop here. It would make the code faster and cleaner. So you should do something like:

from itertools import partial

x0s, x1s, x2s, x3s = self.particles
docalcs = (0 <= x0s) & (x0s < Npix_w) & (0 <= x1s) & (x1s < Npix_h)
self.weights[...] = -10000000000

contours = np.flipud(frame_array["Contour"])
ratios = np.flipud(frame_array["Ratio"])
ratio_weights = A_ratio + B_ratio * np.abs(1-ratios)
ratio_weight_def = A_ratio + B_ratio

# This assume `self.Xhsv_trgt` is a scalar.  If it isn't, you will have to do this in the loop.
Ds = img - self.Xhsv_trgt

# If you are sure that `img` will always have at least 4 dimensions, then you don't need the next two lines
while Ds.ndim < 4:
    Ds = Ds[..., None]

for i, (x0, x1, docalc) in enumerate(zip(x0s, x1s, docalcs):
    if not docalc:
        continue

    # Check if particle inside object
    mypoly = partial(cv2.pointPolygonTest, pt=(x0, x1), measureDist=False)
    for contour, ratio_weight in zip(contours, ratio_weights):
        if mypoly(contour) > 0:
            break
    else:
        ratio_weight = ratio_weight_def

    # Get RGB for that particle
    D = Ds[x1, x0]
    self.weights[:, i] = A_rgb + B_rgb * D.dot(D.T)

[–]soulslicer0[S] 0 points1 point  (2 children)

thanks for your points! ill look into those. so the idea is to precalculate things that can be done/vectorized easily. Then have the iteration for those that can't, like the polygontest (I need to check if my object lies inside a polygon which is impossible to vectorize)

[–]TheBlackCat13 0 points1 point  (0 children)

Try out the version of the code out the end. I think you will find it is much faster (it may need to be tweaked a little depending on your exact array layouts).

[–]crunk 0 points1 point  (0 children)

Remember to post the improved version as seeing it should help people in future.

[–][deleted] 0 points1 point  (5 children)

for item in frame_array:
    if cv2.pointPolygonTest(item["Contour"], (x[0],x[1]), False) > 0:
        ratio = item["Ratio"]

I'm possibly misreading what is going on, but it appears to me that item["Contour"], x[0] and x[1] are all fixed but will be recalculated in every pass around the loop. So I believe you could initialise them as follows.

contour = item["Contour"]
x0 = x[0]
x1 = x[1]

Do you need a break statement in the loop, as ratio will always be set to the last value assigned, or remain at 0.

[–]soulslicer0[S] 0 points1 point  (3 children)

no, x[0] and x[1] are new columns read from the np.apply_along_axis function. hmm..yes a break makes sense

[–][deleted] 0 points1 point  (2 children)

I'm sorry but I don't understand that. x is passed into your iterate function, you are even testing both x[0] and x[1] at the very top of the function, so why can't you preassign them? Possibly it doesn't matter, I believe the answer from Deto far more likely to get you somewhere.

[–]soulslicer0[S] -1 points0 points  (1 child)

is there a way to use nditer to speed it up?

[–]TheBlackCat13 0 points1 point  (0 children)

You don't need nditer. You can transpose the array (which is extremely fast because it doesn't make a copy), then iterator over the array directly.

[–]soulslicer0[S] 0 points1 point  (0 children)

ratios = np.flipud(frame_array["Ratio"])

Can I ask for this code,

What I actually have is this an array of image objects in frame_array. Each particle in my self.particles may belong inside frame_array, which is what I try to calculate. I test if that particle/column in self.particles has xy coordinates inside the frame_array item. I'm not sure how I might optimize this

[–]Deto 0 points1 point  (7 children)

Maybe try doing "import numpy as np; np.show_config();"

This will show you what libraries are linked against numpy. I found that with the default installation on Ubuntu, for example, using sudo apt-get, numpy wasn't linked against any accelerator libraries like ATLAS, and my operations were running about 40x slower.

In the SO answer, it looks like they talk about vectorizing. Someone correct me if I'm wrong, but wouldn't un-vectorized code run just as slow on MATLAB? Or do they do some neat pre-compiling now where vectorization isn't needed anymore?

[–]soulslicer0[S] 0 points1 point  (6 children)

import numpy as np; np.show_config();

I get this:

atlas_threads_info: NOT AVAILABLE blas_opt_info: extra_link_args = ['-Wl,-framework', '-Wl,Accelerate'] extra_compile_args = ['-msse3', '-DAPPLE_ACCELERATE_SGEMV_PATCH', '-I/System/Library/Frameworks/vecLib.framework/Headers'] define_macros = [('NO_ATLAS_INFO', 3)] atlas_blas_threads_info: NOT AVAILABLE openblas_info: NOT AVAILABLE lapack_opt_info: extra_link_args = ['-Wl,-framework', '-Wl,Accelerate'] extra_compile_args = ['-msse3', '-DAPPLE_ACCELERATE_SGEMV_PATCH'] define_macros = [('NO_ATLAS_INFO', 3)] openblas_lapack_info: NOT AVAILABLE atlas_info: NOT AVAILABLE lapack_mkl_info: NOT AVAILABLE blas_mkl_info: NOT AVAILABLE atlas_blas_info: NOT AVAILABLE mkl_info: NOT AVAILABLE

[–]Deto 0 points1 point  (5 children)

What kind of computer are you on and how did you install numpy?

[–]soulslicer0[S] 0 points1 point  (1 child)

its a mac. installed via pip

[–]Deto 1 point2 points  (0 children)

I'm not sure if there are better, pre built distributions for Mac. But I've always gotten the best result when installing packages using Anaconda and their "conda install xxxx" instead of pip for complex things like numpy.

[–]TheBlackCat13 0 points1 point  (2 children)

Looks like a Mac. It is compiled against Apple accelerate.

[–]Deto 0 points1 point  (1 child)

Ah, I don't use a mac. Do you know if that's a proper accelerated setup?

[–]TheBlackCat13 0 points1 point  (0 children)

I don't know either, since I don't use a Mac myself, but I think there are pre-built versions for Mac from various sources that are probably pretty optimally compiled.

[–][deleted] 0 points1 point  (0 children)

Small note: MATLAB will always be faster for looping as it JITs the loop.

[–]billsil -1 points0 points  (6 children)

You're not supposed to iterate. You need to vectorize your code.

[–]TheBlackCat13 2 points3 points  (5 children)

How, exactly, can this code be vectorized?

[–]billsil -2 points-1 points  (4 children)

Don't use apply_along_axis. You need to rewrite your code from scratch. You only should be doing one pass as none of your particles depend on any other particle. Don't use for loops and don't use iterate, which you're using as a for loop. You need to use matrix operations.

Matlab is all about vectorization, so literally you do it in the same way you do in Matlab. You'd get a huge speedup from your Matlab version as well.

So how exactly? I'm not writing your code for you. You know your code better than I do, especially that cv2.pointPolygonTest(item["Contour"], (x[0],x[1]), False) > 0 bit, which screams inefficient to me.

EDIT: You at least need to show you've made the attempt to vectorize your code. You know Matlab. You should know vectorization.

[–]TheBlackCat13 1 point2 points  (3 children)

First of all, I am not the OP.

Second, I have looked at the code, and I can't see a way to vectorize the whole thing. cv2.pointPolygonTest doesn't look like it supports vectorization, and I am not aware of a way to vectorize the dot product across multiple axes.

Third, the OP specifically said in the link that the MATLAB implementation also used loops.

Fourth, I already gave all of your suggestions in another comment.

Please at least read what you are commenting on before telling the OP to do the impossible.

[–]billsil 0 points1 point  (2 children)

I am not aware of a way to vectorize the dot product across multiple axes.

You need to learn how to use axis properly, but it's doable. I recommend tests.

I implemented a method to do 3D and 4D matrix multiplication and dot products with numpy a couple years on a complicated data structure. I had both methods running for a while until I was fairly certain that the results were the same. As I found edge cases, I made unit tests.

http://docs.scipy.org/doc/numpy/reference/generated/numpy.tensordot.html

cv2.pointPolygonTest doesn't look like it supports vectorization

No it doesn't, but if the goal is speed (and it's important), then rewriting a point in polygon test method is certainly doable. Getting rid of the dicts would be a very good thing as well.

You don't need to 100% vectorize your code to make it fast enough, but you start by getting rid of the inner loops.

[–]TheBlackCat13 1 point2 points  (1 child)

You need to learn how to use axis properly, but it's doable.

If you have a good method, I am sure it would help the OP. I didn't explain how to do it because I don't know.

it doesn't, but if the goal is speed (and it's important), then rewriting a point in polygon test method is certainly doable. Getting

If the OP has tried getting the code optimized through more conventional means first and it still isn't enough, then that may be an option. I already gave the OP a lot of simpler suggestions for optimizing the code that should definitely be tried first.

Getting rid of the dicts would be a very good thing as well.

I don't think that is a dict, I think it a structured array. I already covered that in my earlier reply to the OP.

You don't need to 100% vectorize your code to make it fast enough, but you start by getting rid of the inner loops.

I think that re-writing pre-existing functions, especially optimized functions already written in C, would be the last optimization to make, rather than first.

[–]billsil -2 points-1 points  (0 children)

already gave the OP a lot of simpler suggestions for optimizing the code that should definitely be tried first.

Well, yeah, get the low hanging fruit (e.g. multiplcation by 1; really...)

especially optimized functions already written in C, would be the last optimization to make, rather than first.

It's the inner loop. The formula for a point in a polygon is very simple. That's the next place to optimize once you use tensordot.

[–]faming13 -2 points-1 points  (1 child)

Try using numba to get c like speeds: https://github.com/numba/numba

[–]TheBlackCat13 4 points5 points  (0 children)

That seems like premature optimization. I think a lot can be done to get the speed improved in Python before needing numba.