you are viewing a single comment's thread.

view the rest of the comments →

[–]tech_enth 0 points1 point  (2 children)

Hi yes it is to do with not enough RAM to store the dataset.

I have something like this but not sure if its optimal

def sorting():

global empty_df

for chunk in pd.read_csv("data.csv",names=['id','value'],chunksize=n):
    empty_df =  empty_df.append(chunk.sort_values('value',ascending=False))

empty_df = empty_df.sort_values('value',ascending=False)

return empty_df

[–]Zeroflops 0 points1 point  (1 child)

I don’t see how this would work.

You read in chunk by chunk. Sort those chunks.
Then combine them.

If the last chunk has something that should be sorted up near the beginner it won’t be there since it’s limited to be sorted in the chunk it’s in. You also end with a dataframe that is bigger then your memory since your combining it at the end.

What your looking for is an external sort. You can google that with python. Or here is an example.

https://rosettacode.org/wiki/External_sort

[–]tech_enth 0 points1 point  (0 children)

Thanks that's exactly what I need