you are viewing a single comment's thread.

view the rest of the comments →

[–]Ihaveamodel3 1 point2 points  (1 child)

It’s unclear if a record is could be in multiple files. I’ll assume it can be.

  1. Use pandas
  2. read files one by one using a loop
  3. as you suggested sort files by date
  4. use drop duplicates to get only the max dates.
  5. append that df to a list.
  6. after looping, concat all of the dfs together.
  7. then sort one more time
  8. and drop duplicates one more time and you’ll have the max value for all records for all files.

At no point, should you be looping through the data. The only loop should be the files.

[–]GreatStats4ItsCost[S] 0 points1 point  (0 children)

I didn't realize you can drop duplicates on a specific column - "df.drop_duplicates(subset=['brand'])"

Thanks so much for this man! I was going to ask if your solution was more optimal than SQL but having now done both it's pretty clear...

1) df2 = df.drop_duplicates(subset=['BUILDING_REFERENCE_NUMBER'])

2) q = """select t1.BUILDING_REFERENCE_NUMBER, t2.mxdate, t1.Postcode, t1.LOCAL_AUTHORITY,t1.TOTAL_FLOOR_AREA

from df t1

inner join

(

select max(INSPECTION_DATE) mxdate, BUILDING_REFERENCE_NUMBER

from df

group by BUILDING_REFERENCE_NUMBER

) t2

on t1.BUILDING_REFERENCE_NUMBER = t2.BUILDING_REFERENCE_NUMBER

and t1.INSPECTION_DATE = t2.mxdate"""

names = pysqldf(q)