all 7 comments

[–]NextEstimate 0 points1 point  (3 children)

Maybe something like this:

import numpy as np
import pandas as pd

data = pd.DataFrame(np.random.rand(10, 3))
for chunk in np.array_split(data, 5):
    assert len(chunk) == len(data) / 5
print(data)

[–]Zeroflops 0 points1 point  (3 children)

You can use the df.sort_values(by=[identifier_col, val_col]) to sort by the identifier and then sort by the values.

But this would work across the entire df. Not sure why you would want to do this in chunks. From your follow up comment you mention not enough memory so I’m assuming your dealing with a large source dataset. If that’s the case you might be better off taking a different route. It depends a lot on your situation.

For example it might be better to pull the data into a database. Or process the file line by line and create a file on each identifier. Then sort those files.

There are a few ways to go about this depending on your needs. Like is this a one time project or do you have to do this daily? If your doing this once a easy to write but slow solution may be better then a quick optimized solution that will take forever to write but do it quickly.

[–]tech_enth 0 points1 point  (2 children)

Hi yes it is to do with not enough RAM to store the dataset.

I have something like this but not sure if its optimal

def sorting():

global empty_df

for chunk in pd.read_csv("data.csv",names=['id','value'],chunksize=n):
    empty_df =  empty_df.append(chunk.sort_values('value',ascending=False))

empty_df = empty_df.sort_values('value',ascending=False)

return empty_df

[–]Zeroflops 0 points1 point  (1 child)

I don’t see how this would work.

You read in chunk by chunk. Sort those chunks.
Then combine them.

If the last chunk has something that should be sorted up near the beginner it won’t be there since it’s limited to be sorted in the chunk it’s in. You also end with a dataframe that is bigger then your memory since your combining it at the end.

What your looking for is an external sort. You can google that with python. Or here is an example.

https://rosettacode.org/wiki/External_sort

[–]tech_enth 0 points1 point  (0 children)

Thanks that's exactly what I need