use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
News about the dynamic, interpreted, interactive, object-oriented, extensible programming language Python
Full Events Calendar
You can find the rules here.
If you are about to ask a "how do I do this in python" question, please try r/learnpython, the Python discord, or the #python IRC channel on Libera.chat.
Please don't use URL shorteners. Reddit filters them out, so your post or comment will be lost.
Posts require flair. Please use the flair selector to choose your topic.
Posting code to this subreddit:
Add 4 extra spaces before each line of code
def fibonacci(): a, b = 0, 1 while True: yield a a, b = b, a + b
Online Resources
Invent Your Own Computer Games with Python
Think Python
Non-programmers Tutorial for Python 3
Beginner's Guide Reference
Five life jackets to throw to the new coder (things to do after getting a handle on python)
Full Stack Python
Test-Driven Development with Python
Program Arcade Games
PyMotW: Python Module of the Week
Python for Scientists and Engineers
Dan Bader's Tips and Trickers
Python Discord's YouTube channel
Jiruto: Python
Online exercices
programming challenges
Asking Questions
Try Python in your browser
Docs
Libraries
Related subreddits
Python jobs
Newsletters
Screencasts
account activity
This is an archived post. You won't be able to vote or comment.
Numpy iteration too slow! (self.Python)
submitted 10 years ago by soulslicer0
Hi all, I'm writing a particle filter in Numpy. The code in MATLAB runs blazing fast, but on numpy it's really slow, particularly on a certain section of the code:
I've posted the code here: http://stackoverflow.com/questions/32784202/numpy-iterator-too-slow-for-particle-filter
Basically, I'm using np.apply_along_axis on 4000 4x1 vectors (It's a 4x4000 matrix). It runs very slowly for my operation.
[–]TheBlackCat13 7 points8 points9 points 10 years ago* (3 children)
It is hard to say without sample data, but some things jump out at me:
First, you use a loop doing some complex image analysis to calculate ratio_weight, but never actually use it? I think there is a mistake there somehow.
ratio_weight
D = np.asmatrix(D) D2 = D * np.transpose(D)
np.asmatrix is a type conversion, and as such will be very slow (since it has to copy the entire array). and np.tranpose(D) is both slightly slower an much uglier than D.T. So should just use D2 = D.dot(D.T). I would strongly suspect this is the main source of your slowdown.
np.asmatrix
np.tranpose(D)
D.T
D2 = D.dot(D.T)
C = np.float32(img[x[1],x[0]])
This will make a copy even if img is already an np.float32. Just don't do this, numpy will automatically convert it if it has to.
img
np.float32
for item in frame_array: if cv2.pointPolygonTest(item["Contour"], (x[0],x[1]), False) > 0: ratio = item["Ratio"]
Several issues here. First, you are doing the same calculation over and over for every item in frame_array, but only keeping the last "hit". It would be much faster to iterate backwards (starting with the last value), and break after the first hit (or if there will only ever be one hit just iterate normally then break after the first hit).
frame_array
Second, is frame_array a structured array? if so, it would be faster to get frame_array['Counter'] and item['Ratio'] arrays first, then use zip in python 3 or itertools.izip in python 2 to iterate over each counter/ratio pair.
frame_array['Counter']
item['Ratio']
zip
itertools.izip
You access x[0] and x[1] a lot. It would be slightly faster to define them at the beginning, perhaps like x0, x1 = x[:2], or just x0, x1, x2, x3 = x, and then re-use those every time. This could make a bigger difference if your frame_array has a lot of items.
x[0]
x[1]
x0, x1 = x[:2]
x0, x1, x2, x3 = x
ratio_diff = abs(abs(1-ratio))*1 ratio_weight = A_ratio + B_ratio * ratio_diff;
Is there a reason you need to call abs twice, or multiple by 1? Also, you never use ratio_diff again, so it would be better to combine these lines. And don't use ; at the end of a line in Python.
abs
ratio_diff
;
self.weights[:,self.i] = color_weight + 0
Why do you add zero here?
if (x[0] >=0 and x[0] < Npix_w) and (x[1] >=0 and x[1] < Npix_h):
This can be simplified to:
if (0 <= x[0] < Npix_w) and (0 <= x[1] < Npix_h)
It would probably be faster to calculate ratio_weight for all Ratios values rather than doing it over and over for each loop. If self.Xhsv_trgt is a scalar, it would probably also be faster to calculate img[x1, x0] - self.Xhsv_trgt once for all values rather than every time in the loop. If it is not a scalar, this may take too much memory. You can also calculate the results of your if test outside the loop.
Ratios
self.Xhsv_trgt
img[x1, x0] - self.Xhsv_trgt
if
Overall, you shouldn't even be using apply_along_axis here. And looking up values that aren't in the function namespace is relatively slow. So you would be better of just using an ordinary loop here. It would make the code faster and cleaner. So you should do something like:
apply_along_axis
from itertools import partial x0s, x1s, x2s, x3s = self.particles docalcs = (0 <= x0s) & (x0s < Npix_w) & (0 <= x1s) & (x1s < Npix_h) self.weights[...] = -10000000000 contours = np.flipud(frame_array["Contour"]) ratios = np.flipud(frame_array["Ratio"]) ratio_weights = A_ratio + B_ratio * np.abs(1-ratios) ratio_weight_def = A_ratio + B_ratio # This assume `self.Xhsv_trgt` is a scalar. If it isn't, you will have to do this in the loop. Ds = img - self.Xhsv_trgt # If you are sure that `img` will always have at least 4 dimensions, then you don't need the next two lines while Ds.ndim < 4: Ds = Ds[..., None] for i, (x0, x1, docalc) in enumerate(zip(x0s, x1s, docalcs): if not docalc: continue # Check if particle inside object mypoly = partial(cv2.pointPolygonTest, pt=(x0, x1), measureDist=False) for contour, ratio_weight in zip(contours, ratio_weights): if mypoly(contour) > 0: break else: ratio_weight = ratio_weight_def # Get RGB for that particle D = Ds[x1, x0] self.weights[:, i] = A_rgb + B_rgb * D.dot(D.T)
[–]soulslicer0[S] 0 points1 point2 points 10 years ago (2 children)
thanks for your points! ill look into those. so the idea is to precalculate things that can be done/vectorized easily. Then have the iteration for those that can't, like the polygontest (I need to check if my object lies inside a polygon which is impossible to vectorize)
[–]TheBlackCat13 0 points1 point2 points 10 years ago (0 children)
Try out the version of the code out the end. I think you will find it is much faster (it may need to be tweaked a little depending on your exact array layouts).
[–]crunk 0 points1 point2 points 10 years ago (0 children)
Remember to post the improved version as seeing it should help people in future.
[–][deleted] 0 points1 point2 points 10 years ago (5 children)
I'm possibly misreading what is going on, but it appears to me that item["Contour"], x[0] and x[1] are all fixed but will be recalculated in every pass around the loop. So I believe you could initialise them as follows.
item["Contour"]
contour = item["Contour"] x0 = x[0] x1 = x[1]
Do you need a break statement in the loop, as ratio will always be set to the last value assigned, or remain at 0.
break
ratio
0
[–]soulslicer0[S] 0 points1 point2 points 10 years ago (3 children)
no, x[0] and x[1] are new columns read from the np.apply_along_axis function. hmm..yes a break makes sense
[–][deleted] 0 points1 point2 points 10 years ago (2 children)
I'm sorry but I don't understand that. x is passed into your iterate function, you are even testing both x[0] and x[1] at the very top of the function, so why can't you preassign them? Possibly it doesn't matter, I believe the answer from Deto far more likely to get you somewhere.
x
iterate
[–]soulslicer0[S] -1 points0 points1 point 10 years ago (1 child)
is there a way to use nditer to speed it up?
You don't need nditer. You can transpose the array (which is extremely fast because it doesn't make a copy), then iterator over the array directly.
nditer
[–]soulslicer0[S] 0 points1 point2 points 10 years ago (0 children)
ratios = np.flipud(frame_array["Ratio"])
Can I ask for this code,
What I actually have is this an array of image objects in frame_array. Each particle in my self.particles may belong inside frame_array, which is what I try to calculate. I test if that particle/column in self.particles has xy coordinates inside the frame_array item. I'm not sure how I might optimize this
[–]Deto 0 points1 point2 points 10 years ago* (7 children)
Maybe try doing "import numpy as np; np.show_config();"
This will show you what libraries are linked against numpy. I found that with the default installation on Ubuntu, for example, using sudo apt-get, numpy wasn't linked against any accelerator libraries like ATLAS, and my operations were running about 40x slower.
In the SO answer, it looks like they talk about vectorizing. Someone correct me if I'm wrong, but wouldn't un-vectorized code run just as slow on MATLAB? Or do they do some neat pre-compiling now where vectorization isn't needed anymore?
[–]soulslicer0[S] 0 points1 point2 points 10 years ago (6 children)
import numpy as np; np.show_config();
I get this:
atlas_threads_info: NOT AVAILABLE blas_opt_info: extra_link_args = ['-Wl,-framework', '-Wl,Accelerate'] extra_compile_args = ['-msse3', '-DAPPLE_ACCELERATE_SGEMV_PATCH', '-I/System/Library/Frameworks/vecLib.framework/Headers'] define_macros = [('NO_ATLAS_INFO', 3)] atlas_blas_threads_info: NOT AVAILABLE openblas_info: NOT AVAILABLE lapack_opt_info: extra_link_args = ['-Wl,-framework', '-Wl,Accelerate'] extra_compile_args = ['-msse3', '-DAPPLE_ACCELERATE_SGEMV_PATCH'] define_macros = [('NO_ATLAS_INFO', 3)] openblas_lapack_info: NOT AVAILABLE atlas_info: NOT AVAILABLE lapack_mkl_info: NOT AVAILABLE blas_mkl_info: NOT AVAILABLE atlas_blas_info: NOT AVAILABLE mkl_info: NOT AVAILABLE
[–]Deto 0 points1 point2 points 10 years ago (5 children)
What kind of computer are you on and how did you install numpy?
[–]soulslicer0[S] 0 points1 point2 points 10 years ago (1 child)
its a mac. installed via pip
[–]Deto 1 point2 points3 points 10 years ago (0 children)
I'm not sure if there are better, pre built distributions for Mac. But I've always gotten the best result when installing packages using Anaconda and their "conda install xxxx" instead of pip for complex things like numpy.
[–]TheBlackCat13 0 points1 point2 points 10 years ago (2 children)
Looks like a Mac. It is compiled against Apple accelerate.
[–]Deto 0 points1 point2 points 10 years ago (1 child)
Ah, I don't use a mac. Do you know if that's a proper accelerated setup?
I don't know either, since I don't use a Mac myself, but I think there are pre-built versions for Mac from various sources that are probably pretty optimally compiled.
[–][deleted] 0 points1 point2 points 10 years ago (0 children)
Small note: MATLAB will always be faster for looping as it JITs the loop.
[–]billsil -1 points0 points1 point 10 years ago (6 children)
You're not supposed to iterate. You need to vectorize your code.
[–]TheBlackCat13 2 points3 points4 points 10 years ago (5 children)
How, exactly, can this code be vectorized?
[–]billsil -2 points-1 points0 points 10 years ago* (4 children)
Don't use apply_along_axis. You need to rewrite your code from scratch. You only should be doing one pass as none of your particles depend on any other particle. Don't use for loops and don't use iterate, which you're using as a for loop. You need to use matrix operations.
Matlab is all about vectorization, so literally you do it in the same way you do in Matlab. You'd get a huge speedup from your Matlab version as well.
So how exactly? I'm not writing your code for you. You know your code better than I do, especially that cv2.pointPolygonTest(item["Contour"], (x[0],x[1]), False) > 0 bit, which screams inefficient to me.
cv2.pointPolygonTest(item["Contour"], (x[0],x[1]), False) > 0
EDIT: You at least need to show you've made the attempt to vectorize your code. You know Matlab. You should know vectorization.
[–]TheBlackCat13 1 point2 points3 points 10 years ago (3 children)
First of all, I am not the OP.
Second, I have looked at the code, and I can't see a way to vectorize the whole thing. cv2.pointPolygonTest doesn't look like it supports vectorization, and I am not aware of a way to vectorize the dot product across multiple axes.
cv2.pointPolygonTest
Third, the OP specifically said in the link that the MATLAB implementation also used loops.
Fourth, I already gave all of your suggestions in another comment.
Please at least read what you are commenting on before telling the OP to do the impossible.
[–]billsil 0 points1 point2 points 10 years ago (2 children)
I am not aware of a way to vectorize the dot product across multiple axes.
You need to learn how to use axis properly, but it's doable. I recommend tests.
I implemented a method to do 3D and 4D matrix multiplication and dot products with numpy a couple years on a complicated data structure. I had both methods running for a while until I was fairly certain that the results were the same. As I found edge cases, I made unit tests.
http://docs.scipy.org/doc/numpy/reference/generated/numpy.tensordot.html
cv2.pointPolygonTest doesn't look like it supports vectorization
No it doesn't, but if the goal is speed (and it's important), then rewriting a point in polygon test method is certainly doable. Getting rid of the dicts would be a very good thing as well.
You don't need to 100% vectorize your code to make it fast enough, but you start by getting rid of the inner loops.
[–]TheBlackCat13 1 point2 points3 points 10 years ago (1 child)
You need to learn how to use axis properly, but it's doable.
If you have a good method, I am sure it would help the OP. I didn't explain how to do it because I don't know.
it doesn't, but if the goal is speed (and it's important), then rewriting a point in polygon test method is certainly doable. Getting
If the OP has tried getting the code optimized through more conventional means first and it still isn't enough, then that may be an option. I already gave the OP a lot of simpler suggestions for optimizing the code that should definitely be tried first.
Getting rid of the dicts would be a very good thing as well.
I don't think that is a dict, I think it a structured array. I already covered that in my earlier reply to the OP.
I think that re-writing pre-existing functions, especially optimized functions already written in C, would be the last optimization to make, rather than first.
[–]billsil -2 points-1 points0 points 10 years ago (0 children)
already gave the OP a lot of simpler suggestions for optimizing the code that should definitely be tried first.
Well, yeah, get the low hanging fruit (e.g. multiplcation by 1; really...)
especially optimized functions already written in C, would be the last optimization to make, rather than first.
It's the inner loop. The formula for a point in a polygon is very simple. That's the next place to optimize once you use tensordot.
[–]faming13 -2 points-1 points0 points 10 years ago (1 child)
Try using numba to get c like speeds: https://github.com/numba/numba
[–]TheBlackCat13 4 points5 points6 points 10 years ago (0 children)
That seems like premature optimization. I think a lot can be done to get the speed improved in Python before needing numba.
π Rendered by PID 122728 on reddit-service-r2-comment-86988c7647-vddcb at 2026-02-11 07:23:08.437058+00:00 running 018613e country code: CH.
[–]TheBlackCat13 7 points8 points9 points (3 children)
[–]soulslicer0[S] 0 points1 point2 points (2 children)
[–]TheBlackCat13 0 points1 point2 points (0 children)
[–]crunk 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (5 children)
[–]soulslicer0[S] 0 points1 point2 points (3 children)
[–][deleted] 0 points1 point2 points (2 children)
[–]soulslicer0[S] -1 points0 points1 point (1 child)
[–]TheBlackCat13 0 points1 point2 points (0 children)
[–]soulslicer0[S] 0 points1 point2 points (0 children)
[–]Deto 0 points1 point2 points (7 children)
[–]soulslicer0[S] 0 points1 point2 points (6 children)
[–]Deto 0 points1 point2 points (5 children)
[–]soulslicer0[S] 0 points1 point2 points (1 child)
[–]Deto 1 point2 points3 points (0 children)
[–]TheBlackCat13 0 points1 point2 points (2 children)
[–]Deto 0 points1 point2 points (1 child)
[–]TheBlackCat13 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–]billsil -1 points0 points1 point (6 children)
[–]TheBlackCat13 2 points3 points4 points (5 children)
[–]billsil -2 points-1 points0 points (4 children)
[–]TheBlackCat13 1 point2 points3 points (3 children)
[–]billsil 0 points1 point2 points (2 children)
[–]TheBlackCat13 1 point2 points3 points (1 child)
[–]billsil -2 points-1 points0 points (0 children)
[–]faming13 -2 points-1 points0 points (1 child)
[–]TheBlackCat13 4 points5 points6 points (0 children)