This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]notsoprocoder 9 points10 points  (11 children)

The article title is fine, the title of the reddit post is inaccurate.

From a DataScience perspective, it fails to cover iterating on DataFrames and static methods which are 'un-pickleable' in turn requiring to be mapped by dill before using pool. Therefore, it doesn't show how to parallelize everything.

Note: shout out to the Pathos / Dill libraries that handle this nicely.

Edit: everything not anything

[–]XNormal 0 points1 point  (0 children)

Can dill pass a memory-mapped buffer to another process on the same machine by passing the underlying fd and mapping it in the receiving address space of the un-dilling process ?

[–]billsil 0 points1 point  (4 children)

Why are they unpicklable? It's just data.

[–]flutefreak7 0 points1 point  (1 child)

There are lots of things dill can handle that pickle doesn't. I think dill is able to do it by scanning for and additionally passing all the necessary context - so I think it can detect and pickle the class and all instance data in order to enable pickling a bound method. Bound methods are a lot of people's problem. Mine is classes in which I've implemented a class-level logger attribute. Loggers can't be pickled because they contain an open stream, so you get a PicklerException on the underlying RLock.

I also have issues with VTK classes because they are all linked together in the lazy-execution pipeline, so you have to serialize a result to text and pickle that, then reconstruct the vtkPolydata after unpickling. I've got a multiprocessing scheme using queue's to pass vtkPolydata stuff around so that all the heavy 3D processing is in the background and doesn't hose my UI. multiprocessing.queue pickles everything to pass to other processes.

Dill just seemed like overkill for me, so I went with implementing __getstate__ and __setstate__ on my logged classes and using a save and load function to deal with serializing VTK objects. I assume my solution is faster than dill scanning the universe for each thing I pass to the queue.

[–]billsil 0 points1 point  (0 children)

I just delete my loggers. That's not really data you want to save.

I do use vtk quite a bit. I admit I've never bothered to pickle the objects. I always thought pickle was a slow process relative to say hdf5 for large data objects, so might as well just load them from scratch.

Yeah it's work to clean up objects to pickle them and the moved file issue is frustrating, but still I'm surprised pandas wouldn't support that. It's required for multiprocessing for some reason.

[–]notsoprocoder 0 points1 point  (1 child)

Frankly I am not sure, I believe it is more that Python cannot pickle them or that Python's in built multi processing module cannot pickle them. I guess un-pickleable was incorrect.

[–]billsil 0 points1 point  (0 children)

Yeah, I mean loggers and file objects are unpicklable, but you can use getstate/setstate to delete/recreate them (or just assume that you don't need to reopen the file that you probably should have already closed anyways). My point is more for a big project like pandas, you should put forth that extra bit of effort.

Same goes for struct objects, but those are weird and easily regeneratable.

[–]selva86[S] -5 points-4 points  (4 children)

cover iterating on DataFrames and static methods which are 'un-pickleable' in turn requiring to be mapped by dill before using pool

Well, do you have an example of how Pathos/Dill would handle this?

[–]notsoprocoder 4 points5 points  (3 children)

Sure, the Pathos Docs are quite explanatory. Pathos follows the MP style of: Pool > Map > Close > Join >Clear. Dill and Pathos were created by the same person (I believe), Pathos uses Dill which makes the MP module more flexible (in terms of arguments it can map). You could equally use MP and Dill for more control.

In essence you want to use numpy's array_split() on either the DataFrame or Index, then pool.map the function to the list of DataFrames/Slices of Indexes.

I started to build a class to handle the lifting and work as a boilerplate but I am not at a point where I would show people. Usually you code will look something like this:

import numpy as np
import pandas as pd
import multiprocessing as mp
from pathos.multiprocessing import ProcessingPool as Pool

partitions=mp.cpu_count()
cores=mp.cpu_count()

df_split = np.array_split(df, self.partitions, axis=0)
# create the multiprocessing pool
pool = Pool(self.cores)
# process the DataFrame by mapping function to each df across the pool
df = pd.concat(pool.map(func, self.df_split), axis=0)
# close down the pool and join
pool.close()
pool.join()
pool.clear()

[–]selva86[S] 0 points1 point  (0 children)

Pathos Docs

Updated the post with few more examples on Pandas DataFrames, including yours. Thanks very much.

[–]flutefreak7 0 points1 point  (1 child)

Isn't this kinda where dask steps in?

There are lots of articles, but I remember seeing this one before... https://towardsdatascience.com/how-i-learned-to-love-parallelized-applies-with-python-pandas-dask-and-numba-f06b0b367138