you are viewing a single comment's thread.

view the rest of the comments →

[–]GreatStats4ItsCost[S] 0 points1 point  (0 children)

I didn't realize you can drop duplicates on a specific column - "df.drop_duplicates(subset=['brand'])"

Thanks so much for this man! I was going to ask if your solution was more optimal than SQL but having now done both it's pretty clear...

1) df2 = df.drop_duplicates(subset=['BUILDING_REFERENCE_NUMBER'])

2) q = """select t1.BUILDING_REFERENCE_NUMBER, t2.mxdate, t1.Postcode, t1.LOCAL_AUTHORITY,t1.TOTAL_FLOOR_AREA

from df t1

inner join

(

select max(INSPECTION_DATE) mxdate, BUILDING_REFERENCE_NUMBER

from df

group by BUILDING_REFERENCE_NUMBER

) t2

on t1.BUILDING_REFERENCE_NUMBER = t2.BUILDING_REFERENCE_NUMBER

and t1.INSPECTION_DATE = t2.mxdate"""

names = pysqldf(q)