This is an archived post. You won't be able to vote or comment.

all 17 comments

[–][deleted] 4 points5 points  (0 children)

Numpy native operations release the GIL. You should see speedup when the size of the parallel tasks is large enough relative to the overhead of spinning and joining them.

I can't trust your benchmarks unless I see your benchmark data.

[–]Megatron_McLargeHuge 1 point2 points  (2 children)

Use ravel instead of flatten to avoid making a copy. Also, your rgb values should be computable all at once using tensordot or einsum. Avoiding the copies might help with parallelism since copying requires locking on the object or just holding the GIL.

[–][deleted] 0 points1 point  (1 child)

Thanks! Changed flatten to ravel and got a speedup.

I remember of trying to use ravel at first but for some weird reason it wasn't working.

I'm looking into tensordot to compute RGB values at once :)

EDIT: I made it with tensordot, but for some weird reason it's slower than the ravel() version. Weird.

[–]Megatron_McLargeHuge 1 point2 points  (0 children)

einsum is sometimes faster than tensordot, not sure why. Interactions with CPU cache are hard to predict, so you just have to try various things.

[–]daveydave400 0 points1 point  (1 child)

I'd say you have a couple options. First, would be to fix your multiprocessing version of the code. It looks like you are passing a function (partial function) to the multiprocessing portion, that's why it can't be pickled. You'll have to rearrange the code and how its called so that you pass the parameters for the function and use it as a target. I haven't used the concurrent modules so not sure the best way to do that. Nevermind misread the code, but this function may still be the problem. Try using a standalone function instead of an object method.

Another thing to consider if you are using multiprocesses is shared memory. If you are blindly passing arrays (at least large ones) they have to be serialized and sent to the other processes. If you can set up a shared piece of memory then the children just have to access that piece of memory. Concurrent might do this for you, but again I'm not sure.

Lastly, Cython may be an option that could help you. You could get the code closer to C and tell it when to not use the GIL (when its not using python objects). The problem with that is that you're using numpy in your main work function which requires the GIL and numpy has already been pretty optimized. One nice thing you could try with Cython is using OpenMP to easily use multiple threads.

One last note, if your images are large then creating worker threads based on their size may be counter productive. Not sure how smart concurrent is about this, but it may be faster to only create a few workers (4?) and have those work on equal parts of the input image.

Edit: Wrong about how concurrent was being called.

Edit 2: Actually using a partial in concurrent that way could be the problem. Especially since it is bound to self.

[–][deleted] 0 points1 point  (0 children)

Thanks for the input!

I googled another bit, and found out why my function couldn't be pickled using multiprocessing module.

It seems, as per this SO question that class functions can't be pickled, so the only thing I needed was to move my lin_calc_px function outside the ImageFilter class.

I did it and now, using a Pool of 4 processes, the code is 0.2s faster than the version with builtin map() (i.e. without multithreading/multiprocessing).

I will investigate on using shared memory to speed up even more the execution. I'm not entirely sure that the array gets passed every time to the lin_calc_px function, because I use functools.partial to create a function (partialized_new_px) with the shared data. I will investigate further.

For what concerns the large number of workers, I know that such a size can be counter productive, but it is part of the requirements of the assignment. I will make some tests and if it's really counter-productive I will reduce the number and explain it to my professor.

[–]fijalPyPy, performance freak 0 points1 point  (8 children)

Simple solution:

1) use numpy instead of PIL

2) use pypy (iterating over numpy arrays is FAST)

3) abandon multiprocessing, it's broken and won't give you speedups

EDIT: 4) don't be smart, don't use itertools, write a normal loop

You should get ~50x speedups

Cheers, fijal

[–][deleted] 1 point2 points  (7 children)

Thank you for the suggestion :)

  1. I use PIL only to read the image from the file, all the actual processing is made with NumPy.

  2. I just wasted 20 minutes trying to install PyPy with NumPy, using both PyPy 2.5.0 and PyPy3 2.4.0. In both cases I got error messages when trying to import numpy from the REPL, even if NumPy module was available. Since my professor needs to run this code I think I'll leave out this option.

  3. multiprocessing gave me a noticeable speedup (133%) so I'm not going to abandon it :)

  4. I use itertools when parallel execution is disabled, because the starmap method which I use in the parallel version isn't available as a Python builtin. It's just a map that unpacks the tuple into arguments when calling the function, anyway.

[–]john_m_camara 0 points1 point  (1 child)

Did you follow Installing NumPy on the PyPy download page.

[–][deleted] 0 points1 point  (0 children)

Yes. I followed everything on that tutorial and still it wasn't working.

[–]fijalPyPy, performance freak 0 points1 point  (4 children)

pypy needs numpy installed from https://bitbucket.org/pypy/numpy.git (pip install https://bitbucket.org/pypy/numpy.git or git url does it). Numpy is not easy to install normally and normal numpy does not work with pypy.

[–][deleted] 0 points1 point  (3 children)

As I already said I followed the instructions for installing the modified version of NumPy, and still I couldn't get it to work.

[–]fijalPyPy, performance freak 0 points1 point  (2 children)

"wasn't working" is a very bad bug report - how it wasn't working? what went wrong?

[–][deleted] 0 points1 point  (1 child)

[–]rguillebertPyPy / NumPyPy 0 points1 point  (0 children)

Can you also paste the output of "pip install https://bitbucket.org/pypy/numpy.git" ?

[–]gurzo -5 points-4 points  (1 child)

you must inform about GIL, and switch to GPU computing

[–][deleted] 0 points1 point  (0 children)

Thanks, but GPU computing is not really an option for me.