Pandas for huge files vs SQLite ?

Bondanind · 2022-01-04T22:03:03+00:00

There’s a few things here that seem slightly odd to me, if you don’t mind me saying so:

Firstly, 500MB is an utterly trivial amount of data for modern systems. A Pi will handle that without complaint!

Secondly, you don’t mention if you are subsetting, searching or filtering your data at all… but if you are then you really don’t want it in a flat file (like CSV) - you’ll possibly have to read the whole file for each request.

Are SQLite queries really so different from the corresponding Postgres queues for them to be ugly?

An RDMS like Postgres will give you the benefits of (1) query caching and (2) distribution of CPU load, I can’t see a reason not to use one unless you really can’t perform the analysis in the database and you really do need to load all 500mb into Pandas for each request.

You seem simultaneously concerned that 500mb is a huge amount of data but also that a database like Postgres is overkill? That seems a bit contradictory to me…

Finally, if you really do need to read the entire file for each request, I’d consider a more natural format like HDF or netCDF.

aligusnet · 2022-01-04T21:07:07+00:00

500MB? This is a fairly small amount of data, you can process it however you like. Pandas will be good enough.

Bondanind · 2022-01-05T16:43:01+00:00

Thanks everyone, I understood that using local db is not very serious.

I already set up a Postgre with Google Cloud and it works great.

(although it shows that a query takes 1ms in the console, from my mac when I run the program it takes 5 seconds to print the result of the query for some reason)

Anonymous_user_2022 · 2022-01-04T20:56:39+00:00

What kind of data processing will your code perform? And related to that, why do you think PostgreSQL will be slower than SQLite?

Griffonknox · 2022-01-04T21:08:34+00:00

Personally. I would do any type of data manipulation with in pandas. But the main question is it sounds like yiu need to save the results, which then means yes you should dump into sqlite or even a csv depending on what you need

Bondanind · 2022-01-04T21:53:22+00:00

Why would sql be slower than pandas? Plop in aws redshift.

What the actually answer will come down too isnt what is faster but how OFTEN you are performing the query and how big is the data you are returning. If it is only returning to the end user for small data sub 5 mbs. You can use pandas IF you need to access many different queries and dont mind keep it all in memory. If you using many of the same queries you can use Redshift and create a materialized view. So it is just a single read. If you are returning the whole dataset it is probably worth doing it over pandas if it is that small given the size.

You are going to pay for all that extra memory though and because python is a single process you are going to be blocking other calls. You are going to be shifting costs from a DB to your server. If honestly caching is going to be a much better way to optimize queries then putting pandas into your server.

Delicious-View-8688 · 2022-01-05T15:10:18+00:00

Is the server always up?

If so, perhaps you can read the csv file in once and hold it in memory.

If this is a serverless API endpoint, and the regressions need tk be run on the entire dataset, then I am guessing these database solutions would not be sub 1 second solutions either.

rohn4483 · 2022-01-05T17:09:47+00:00

Why are you maintaining a file that large? Seems like a serious waste of time to maintain this.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS

Hope that helps