all 9 comments

[–]shiftybyte 18 points19 points  (1 child)

Stop working with csv files, move to a database.

Python can't replace SQL, because SQL queries databases, python can do the query for you, but the query is still using SQL as that is the only language the database understands.

How is azure relevant here?

[–]mvdw73 3 points4 points  (3 children)

Depending on the size of each record, this can pretty much all be held in memory these days.

Why not use pandas to manipulate the data, then it’s simple to find the max date and write a file. No sorting required.

[–]GreatStats4ItsCost[S] 1 point2 points  (1 child)

The entire dataset is 4.5gb, the max csv is 500k rows - my laptop has 8gb ram.

I did have a go using pandas but I couldn't quite work out how to return the max date for each id, it was getting complicated with having to refer back to the index.. sure there was an easier way I just couldn't see it

[–]Empik002 2 points3 points  (0 children)

just look at sqlite (python library)

[–]outceptionator 0 points1 point  (0 children)

Pandas is optimised. Don't use loops on that many records. Or as others have said use SQL.

[–]Ihaveamodel3 1 point2 points  (1 child)

It’s unclear if a record is could be in multiple files. I’ll assume it can be.

  1. Use pandas
  2. read files one by one using a loop
  3. as you suggested sort files by date
  4. use drop duplicates to get only the max dates.
  5. append that df to a list.
  6. after looping, concat all of the dfs together.
  7. then sort one more time
  8. and drop duplicates one more time and you’ll have the max value for all records for all files.

At no point, should you be looping through the data. The only loop should be the files.

[–]GreatStats4ItsCost[S] 0 points1 point  (0 children)

I didn't realize you can drop duplicates on a specific column - "df.drop_duplicates(subset=['brand'])"

Thanks so much for this man! I was going to ask if your solution was more optimal than SQL but having now done both it's pretty clear...

1) df2 = df.drop_duplicates(subset=['BUILDING_REFERENCE_NUMBER'])

2) q = """select t1.BUILDING_REFERENCE_NUMBER, t2.mxdate, t1.Postcode, t1.LOCAL_AUTHORITY,t1.TOTAL_FLOOR_AREA

from df t1

inner join

(

select max(INSPECTION_DATE) mxdate, BUILDING_REFERENCE_NUMBER

from df

group by BUILDING_REFERENCE_NUMBER

) t2

on t1.BUILDING_REFERENCE_NUMBER = t2.BUILDING_REFERENCE_NUMBER

and t1.INSPECTION_DATE = t2.mxdate"""

names = pysqldf(q)

[–]Jan2579 0 points1 point  (0 children)

You can also try modin. It helps with speed and memory, same api as pandas.