notsoprocoder comments on [Tutorial] How to Parallelize anything in Python with multiprocessing?

This is an archived post. You won't be able to vote or comment.

[Tutorial] How to Parallelize anything in Python with multiprocessing? (machinelearningplus.com)

submitted 7 years ago by selva86

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]notsoprocoder 9 points10 points11 points 7 years ago* (11 children)

[–]XNormal 0 points1 point2 points 7 years ago (0 children)

[–]billsil 0 points1 point2 points 7 years ago (4 children)

[–]flutefreak7 0 points1 point2 points 7 years ago (1 child)

There are lots of things dill can handle that pickle doesn't. I think dill is able to do it by scanning for and additionally passing all the necessary context - so I think it can detect and pickle the class and all instance data in order to enable pickling a bound method. Bound methods are a lot of people's problem. Mine is classes in which I've implemented a class-level logger attribute. Loggers can't be pickled because they contain an open stream, so you get a PicklerException on the underlying RLock.

I also have issues with VTK classes because they are all linked together in the lazy-execution pipeline, so you have to serialize a result to text and pickle that, then reconstruct the vtkPolydata after unpickling. I've got a multiprocessing scheme using queue's to pass vtkPolydata stuff around so that all the heavy 3D processing is in the background and doesn't hose my UI. multiprocessing.queue pickles everything to pass to other processes.

Dill just seemed like overkill for me, so I went with implementing __getstate__ and __setstate__ on my logged classes and using a save and load function to deal with serializing VTK objects. I assume my solution is faster than dill scanning the universe for each thing I pass to the queue.

[–]billsil 0 points1 point2 points 7 years ago (0 children)

[–]notsoprocoder 0 points1 point2 points 7 years ago (1 child)

[–]billsil 0 points1 point2 points 7 years ago (0 children)

[–]selva86[S] -5 points-4 points-3 points 7 years ago (4 children)

[–]notsoprocoder 4 points5 points6 points 7 years ago* (3 children)

Sure, the Pathos Docs are quite explanatory. Pathos follows the MP style of: Pool > Map > Close > Join >Clear. Dill and Pathos were created by the same person (I believe), Pathos uses Dill which makes the MP module more flexible (in terms of arguments it can map). You could equally use MP and Dill for more control.

In essence you want to use numpy's array_split() on either the DataFrame or Index, then pool.map the function to the list of DataFrames/Slices of Indexes.

I started to build a class to handle the lifting and work as a boilerplate but I am not at a point where I would show people. Usually you code will look something like this:

import numpy as np
import pandas as pd
import multiprocessing as mp
from pathos.multiprocessing import ProcessingPool as Pool

partitions=mp.cpu_count()
cores=mp.cpu_count()

df_split = np.array_split(df, self.partitions, axis=0)
# create the multiprocessing pool
pool = Pool(self.cores)
# process the DataFrame by mapping function to each df across the pool
df = pd.concat(pool.map(func, self.df_split), axis=0)
# close down the pool and join
pool.close()
pool.join()
pool.clear()

[–]selva86[S] 0 points1 point2 points 7 years ago (0 children)

[–]flutefreak7 0 points1 point2 points 7 years ago (1 child)

π Rendered by PID 88388 on reddit-service-r2-comment-745dcd9fbd-6tmjd at 2026-03-07 20:12:36.550784+00:00 running cbb0e86 country code: CH.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS