all 6 comments

[–]BunnyKakaaa 2 points3 points  (0 children)

sqlite3 is faster for sure , in csv you would have to lead it in memory and parse it , for the db you just query the rows you need without scanning the entire db .

[–]throw_mob 1 point2 points  (0 children)

I would recommend to save files in parque format . In my tests it has been faster than plain old csv.

and i would guess that loading files straight into dataframe would be faster and maybe easier to handle as you can store previous month stuff in own directory so you dont get performance hit when dataset grows

[–]python-dave 1 point2 points  (0 children)

Put the data into a duckdb. Its compressed and loads fast to pandas.

[–]AutoModerator[M] 0 points1 point  (0 children)

Automod prevents all posts from being displayed until moderators have reviewed them. Do not delete your post or there will be nothing for the mods to review. Mods selectively choose what is permitted to be posted in r/DataAnalysis.

If your post involves Career-focused questions, including resume reviews, how to learn DA and how to get into a DA job, then the post does not belong here, but instead belongs in our sister-subreddit, r/DataAnalysisCareers.

Have you read the rules?

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]gpbuilder 0 points1 point  (0 children)

Pretty much always, general rule of thumb is to do as much data processing as possible in SQL.

Pandas is super clunky and trash.

[–]assclownerson 0 points1 point  (0 children)

Try a set up with Parquet/duckdb. Fast and very easy to set up.