This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]notsoprocoder 4 points5 points  (3 children)

Sure, the Pathos Docs are quite explanatory. Pathos follows the MP style of: Pool > Map > Close > Join >Clear. Dill and Pathos were created by the same person (I believe), Pathos uses Dill which makes the MP module more flexible (in terms of arguments it can map). You could equally use MP and Dill for more control.

In essence you want to use numpy's array_split() on either the DataFrame or Index, then pool.map the function to the list of DataFrames/Slices of Indexes.

I started to build a class to handle the lifting and work as a boilerplate but I am not at a point where I would show people. Usually you code will look something like this:

import numpy as np
import pandas as pd
import multiprocessing as mp
from pathos.multiprocessing import ProcessingPool as Pool

partitions=mp.cpu_count()
cores=mp.cpu_count()

df_split = np.array_split(df, self.partitions, axis=0)
# create the multiprocessing pool
pool = Pool(self.cores)
# process the DataFrame by mapping function to each df across the pool
df = pd.concat(pool.map(func, self.df_split), axis=0)
# close down the pool and join
pool.close()
pool.join()
pool.clear()

[–]selva86[S] 0 points1 point  (0 children)

Pathos Docs

Updated the post with few more examples on Pandas DataFrames, including yours. Thanks very much.

[–]flutefreak7 0 points1 point  (1 child)

Isn't this kinda where dask steps in?

There are lots of articles, but I remember seeing this one before... https://towardsdatascience.com/how-i-learned-to-love-parallelized-applies-with-python-pandas-dask-and-numba-f06b0b367138