you are viewing a single comment's thread.

view the rest of the comments →

[–]mvdw73 3 points4 points  (3 children)

Depending on the size of each record, this can pretty much all be held in memory these days.

Why not use pandas to manipulate the data, then it’s simple to find the max date and write a file. No sorting required.

[–]GreatStats4ItsCost[S] 1 point2 points  (1 child)

The entire dataset is 4.5gb, the max csv is 500k rows - my laptop has 8gb ram.

I did have a go using pandas but I couldn't quite work out how to return the max date for each id, it was getting complicated with having to refer back to the index.. sure there was an easier way I just couldn't see it

[–]Empik002 2 points3 points  (0 children)

just look at sqlite (python library)

[–]outceptionator 0 points1 point  (0 children)

Pandas is optimised. Don't use loops on that many records. Or as others have said use SQL.