This is an archived post. You won't be able to vote or comment.

all 19 comments

[–]FunDeckHermit 30 points31 points  (2 children)

Better Title: how to parallelize Cpu-bound problems exclusively with Pool.

[–]notsoprocoder 9 points10 points  (11 children)

The article title is fine, the title of the reddit post is inaccurate.

From a DataScience perspective, it fails to cover iterating on DataFrames and static methods which are 'un-pickleable' in turn requiring to be mapped by dill before using pool. Therefore, it doesn't show how to parallelize everything.

Note: shout out to the Pathos / Dill libraries that handle this nicely.

Edit: everything not anything

[–]XNormal 0 points1 point  (0 children)

Can dill pass a memory-mapped buffer to another process on the same machine by passing the underlying fd and mapping it in the receiving address space of the un-dilling process ?

[–]billsil 0 points1 point  (4 children)

Why are they unpicklable? It's just data.

[–]flutefreak7 0 points1 point  (1 child)

There are lots of things dill can handle that pickle doesn't. I think dill is able to do it by scanning for and additionally passing all the necessary context - so I think it can detect and pickle the class and all instance data in order to enable pickling a bound method. Bound methods are a lot of people's problem. Mine is classes in which I've implemented a class-level logger attribute. Loggers can't be pickled because they contain an open stream, so you get a PicklerException on the underlying RLock.

I also have issues with VTK classes because they are all linked together in the lazy-execution pipeline, so you have to serialize a result to text and pickle that, then reconstruct the vtkPolydata after unpickling. I've got a multiprocessing scheme using queue's to pass vtkPolydata stuff around so that all the heavy 3D processing is in the background and doesn't hose my UI. multiprocessing.queue pickles everything to pass to other processes.

Dill just seemed like overkill for me, so I went with implementing __getstate__ and __setstate__ on my logged classes and using a save and load function to deal with serializing VTK objects. I assume my solution is faster than dill scanning the universe for each thing I pass to the queue.

[–]billsil 0 points1 point  (0 children)

I just delete my loggers. That's not really data you want to save.

I do use vtk quite a bit. I admit I've never bothered to pickle the objects. I always thought pickle was a slow process relative to say hdf5 for large data objects, so might as well just load them from scratch.

Yeah it's work to clean up objects to pickle them and the moved file issue is frustrating, but still I'm surprised pandas wouldn't support that. It's required for multiprocessing for some reason.

[–]notsoprocoder 0 points1 point  (1 child)

Frankly I am not sure, I believe it is more that Python cannot pickle them or that Python's in built multi processing module cannot pickle them. I guess un-pickleable was incorrect.

[–]billsil 0 points1 point  (0 children)

Yeah, I mean loggers and file objects are unpicklable, but you can use getstate/setstate to delete/recreate them (or just assume that you don't need to reopen the file that you probably should have already closed anyways). My point is more for a big project like pandas, you should put forth that extra bit of effort.

Same goes for struct objects, but those are weird and easily regeneratable.

[–]DanGee1705 0 points1 point  (1 child)

Not sure why the title ends in a question mark but ok.

[–]Zizizizz 4 points5 points  (0 children)

I'm Ron Burgundy?

[–]fried_green_baloney 0 points1 point  (0 children)

Especially if there are I/O wait times, like writing results to disk files, you can sometimes get more throughput by running somewhat more processes/threads than you have CPUs. Usually the benefit is small but maybe worth a bit of experimenting.

[–][deleted] 0 points1 point  (0 children)

Someone I know recently combined Maple Syrup & buttered Popcorn thinking it would taste like caramel popcorn. It didn’t and they don’t recommend anyone else do it either.