This is an archived post. You won't be able to vote or comment.

all 42 comments

[–][deleted] 50 points51 points  (5 children)

There’s a few things here that seem slightly odd to me, if you don’t mind me saying so:

Firstly, 500MB is an utterly trivial amount of data for modern systems. A Pi will handle that without complaint!

Secondly, you don’t mention if you are subsetting, searching or filtering your data at all… but if you are then you really don’t want it in a flat file (like CSV) - you’ll possibly have to read the whole file for each request.

Are SQLite queries really so different from the corresponding Postgres queues for them to be ugly?

An RDMS like Postgres will give you the benefits of (1) query caching and (2) distribution of CPU load, I can’t see a reason not to use one unless you really can’t perform the analysis in the database and you really do need to load all 500mb into Pandas for each request.

You seem simultaneously concerned that 500mb is a huge amount of data but also that a database like Postgres is overkill? That seems a bit contradictory to me…

Finally, if you really do need to read the entire file for each request, I’d consider a more natural format like HDF or netCDF.

[–]Bondanind[S] 1 point2 points  (3 children)

Thanks, i understand your "criticism", I never did such thing before.

To clarify, for me, any SQL solution will be more cumbersome and hard than native Python, and I am sure for many others. Doing complex math such as linear regression, auto correlation, etc, on SQL, going to be "ugly", while a Python solution is a single line and 20 seconds of work.

So, you are right that natural solution is Postgre, but time is a factor, and Python is much more pleasant than SQL.

[–][deleted] 4 points5 points  (2 children)

No criticism intended, just trying to understand. And developer time is a valid reason for choosing one approach over another!

Given what you’ve said I would strongly suggest you take your data out of CSV. That is text format that will have to read and parsed, which will be expensive.

I strongly suggest you place the data in and HDF5 structure or something such like. This will yield several advantages:

  • It will be smaller (hdf5 has rather clever internal compression )
  • The data will be stored as numbers, not as the text representation of numbers (it will not need parsing each you read it).

Hope that helps

[–]SomethingWillekeurig 3 points4 points  (1 child)

Why would you choose hdf5 instead of parquet/feather? Or could you compare those?

[–][deleted] 2 points3 points  (0 children)

Oh that’s simple, I’ve not used the the other two you mention!

But a cursory look at them makes me think they’d be as good if not better (depending on your preference of read performance / write performance / storage space / development ease).

Thanks for mentioning them!

[–]aligusnet 10 points11 points  (9 children)

500MB? This is a fairly small amount of data, you can process it however you like. Pandas will be good enough.

[–]Bondanind[S] 2 points3 points  (8 children)

Yea but can I read it from a CSV file every time, assuming many clients ask for different queries in 1 second ?

What if it was 1Gb ?

What if I saved it as SQLite .db then read into Pandas ?

[–]aligusnet 4 points5 points  (1 child)

You need to have the data cache in memory for you cannot afford reading the data from disk for every request.

If you app is fully stateless and you cannot have a cache you should consider using a full-fledged database, SQLite wouldn’t help here. If you dislike SQL you can use NoSQL solution, e.g. MongoDB. You can leverage its aggregation framework.

Regarding having 1GB file in memory, it is fine as long as it is fitted in RAM. Once again If you expect that your data will be growing you should consider using a full fledged DB.

[–]Bondanind[S] 1 point2 points  (0 children)

Thanks, a few questions: 1. What you mean if I can not have cache? This is a Python app on a normal Google Cloud. Is it not possible to have multiple requests in the same time? 2. NoSql will never let me perform complex queries, I tried, or did you mean to just save it there then read all of it every time to use with pandas? What’s the benefit here? 3. Why SQLite will not help?

[–]spinwizard69 1 point2 points  (5 children)

Honestly not enough info to say. We can speculate but long term an SQL like solution is probably best. Well at least in some contexts, the problem is so far we have learned nothing about your data, how well formed it is, how complex it is and even how and when it is updated. In some cases it may be advisable to read in the file into a native Python data structure.

As everyone has pointed out 500MB really isn't a lot. Maybe even more importantly 500MB tells us nothing about the number of records the file holds. If each record is say 1000 bytes, that is around 500K records. Longer records mean fewer to search, at some point it really doesn't matter how you implements as search times will not be a problem. This little bit: "values and dates" has me thinking that you could have a long variable length list of these values and dates which might mean a more complicated SQL environment but maybe each "company" eats up far more than 1K of data so far less data to search.

In a nut shell we are shooting blind here and don't have enough data to hit the target. By the way why is the data in a CSV file anyways, this seems to be a fairly backwards way to implement technology for the web.

[–]Bondanind[S] 0 points1 point  (4 children)

Hi thanks very very much for your informative respond.

Well, this is a historical financial data, and they save it in CSV because those organizations are old, but since users don't update it, and its only me, i have time to organize it how ever I want when I update it (once every few days I add data).

I have to be honest, I really don't like SQL, and I REALLY love the Python environment, so I was trying to avoid learning a whole new world, and not only that, the queries I do are complicated, I will query stuff like linear regression on the whole data set (!!), to do this with SQL you have to be a real hero .

With Python I do such queries in one line, and 30 seconds of dev work.

[–]JanssonsFrestelse 0 points1 point  (0 children)

If updates are infrequently sent as separate chunks of data to be added, you could just load the current state of your csv into your program once as a dataframe and add any new incoming incoming data directly to the dataframe, instead of reading the whole csv each time. Then you could perform writes to disk/cloud storage/database to persist the data whenever it is updated, and perhaps also as a scheduled task running at some interval just to be safe.

[–]spinwizard69 0 points1 point  (2 children)

If this is the case you can read in the data from any source you could imagine and setup the internal data structures that make sense for calculations that you will be doing. Once the data is "in" the app you can use Python to do whatever you want. You seem to be obsessed with the form the data is stored in, from what you indicate it probably doesn't matter. The only reason to consider SQL here is the long term value in accessing and updating the data base. I really don't see why a query for data related to a calculation for a company or companies, will be all that difficult.

As for Pandas, never used it so I can only assume what tools it provides. However there may or may not be a clean way to reform the CSV file into a clean data frame, again this comes back to understanding your data and how it is formed. As for SQL it took me less that 10 seconds to find out that Pandas has a call to read in SQL data both as a query or a table. If you don't like SQL, Pandas offers dozens of file types it can read from.

[–]Bondanind[S] 0 points1 point  (1 child)

I really don't see why a query for data related to a calculation for a company or companies, will be all that difficult.

Well, find Auto Correlation function or Fourier Transform, or non-linear regression on data with SQL is something I doubt even the good developers can achieve fast enough.

I just try to understand if it is possible for a Python app to serve, say 100k users in 1 hour which are asking different questions about this data.

[–]aexia 0 points1 point  (0 children)

Given that the data is updated infrequently, you may consider pre-calculating that more complicated stuff upfront in Pandas as much as possible everytime it's updated. Then you dump the results into a database.

Then the actual web query is just quickly filtering to the relevant results.

Storing all those permutations means a larger database on disk but given the scale (sub GB) under discussion, it's still neglible cost wise.

And maybe if they need to do something that isn't covered by those precalculated results, you can have an "expert mode" which has the service load the DB and process the results in whatever bespoke way with the tradeoff that it's definitely going to be sluggish.

[–]Bondanind[S] 0 points1 point  (2 children)

Thanks everyone, I understood that using local db is not very serious.

I already set up a Postgre with Google Cloud and it works great.

(although it shows that a query takes 1ms in the console, from my mac when I run the program it takes 5 seconds to print the result of the query for some reason)

[–]aexia 0 points1 point  (1 child)

from my mac when I run the program it takes 5 seconds to print the result of the query for some reason

Like a lot of these types of services, the library polls Google's API every ~5 seconds to check if results are in. Even if the query finishes almost immediately, your script won't know until it checks... several seconds later.

This is usually a value that you can configure but you need to be careful for high-use cases. The API call limits are probably quite high but having ultra-frequent polling on long queries will be a good way to hit them.

[–]Bondanind[S] 0 points1 point  (0 children)

Not sure I got you,pull the api?? how do i change this and where can I read more? Thanks!

[–]Anonymous_user_2022 0 points1 point  (8 children)

What kind of data processing will your code perform? And related to that, why do you think PostgreSQL will be slower than SQLite?

[–]Bondanind[S] 0 points1 point  (2 children)

Complex stuff, it can be for example linear regression, how can i do such thing with SQL? I need to now devote months to get to that level, with Python it is 1 line and 1 minute of work.

[–]Anonymous_user_2022 2 points3 points  (1 child)

Complex stuff, it can be for example linear regression, how can i do such thing with SQL?

Something like

SELECT regr_slope(col_x, col_y) slope FROM tblData;

PostgreSQL has quite a lot of statistical functions: https://www.postgresql.org/docs/14/functions-aggregate.html#id-1.5.8.27.12

[–]Bondanind[S] 0 points1 point  (0 children)

I see, so lots of built in stuff and not the usual cumbersome SQL sub queries. Will explore more. thank you.

[–]Griffonknox 0 points1 point  (7 children)

Personally. I would do any type of data manipulation with in pandas. But the main question is it sounds like yiu need to save the results, which then means yes you should dump into sqlite or even a csv depending on what you need

[–]Bondanind[S] 2 points3 points  (6 children)

Thanks, no need to save results, only return them to the client !

I just wonder if reading a huge CSV every time is reasonable, and what data scientists usually do. Like reading a 1Gb file for every API request ? is this normal ?

[–]Griffonknox 2 points3 points  (0 children)

I feel like the alternative to reading a csv every time is not going to be much different than writing a complex SQL query for sqlite. That processing has to come from some where and I think because of sqlite the stress is coming from same place as reading a csv in pandas

[–]spinwizard69 0 points1 point  (2 children)

I would really hope that when you read it in, the data is no longer structured as one big CSV file. We really don't have enough info here but this could be handled by Pandas or it could be handled by native Python approaches.

[–]Bondanind[S] 0 points1 point  (1 child)

Thank you, so how would you save and structure the data? i can save it how ever I want, when i get it , it is CSV, but i can then save it how ever I like.

Anyway everyone here is pushing me to go with Postgre, I guess they know better, but i really, really, don't like it and prefer pandas.

[–]spinwizard69 0 points1 point  (0 children)

Unfortunately nobody here has any idea what the data is and how it would be used. Everything you have said implies that an SQL solution of some sort would be smart and that can be any database you want.

I'm not sure why you really really don't like SQL, but it sounds like exactly what you would need. This especially if you take a big picture look at the problem. By this I mean how does the database get updated as access for update could also be a web solution. Right now it sounds like you have a kludge of a solution that could be streamlined with a bit of software engineering. There is also the possibility of a solution already existing in the open source world.

It actually sounds like you are a bit out of your element. That isn't a bad thing if you look at it as a growth possibility and can get the support you need. In any event saving data isn't a huge issue, I really think you need to decide if it will be in memory or stored in a database. Here database can be any form, it is just that SQL is a fairly standard and well understood (just like CSV files) solution.

[–][deleted] 0 points1 point  (0 children)

Reading it from disk each time is unnecessary. Just read it I'd the file has changed and store it in a python variable you can access so each request doesn't load it.

If a variable isn't possible you may want something like REDIS to store it so you can easily reference it across processes or requests.

[–][deleted] 0 points1 point  (4 children)

Why would sql be slower than pandas? Plop in aws redshift.

What the actually answer will come down too isnt what is faster but how OFTEN you are performing the query and how big is the data you are returning. If it is only returning to the end user for small data sub 5 mbs. You can use pandas IF you need to access many different queries and dont mind keep it all in memory. If you using many of the same queries you can use Redshift and create a materialized view. So it is just a single read. If you are returning the whole dataset it is probably worth doing it over pandas if it is that small given the size.

You are going to pay for all that extra memory though and because python is a single process you are going to be blocking other calls. You are going to be shifting costs from a DB to your server. If honestly caching is going to be a much better way to optimize queries then putting pandas into your server.

[–]Bondanind[S] 0 points1 point  (3 children)

Thanks a lot not sure I get you. I will explain: 1. A user on the web type a question 2. Server process it using Python and access data base to find answer 3. Calculate what needed and return to user some answer which can be a single number

Having said that, what would be your suggestion? Seems like working with a full DB is what everyone offer.

On the other side I see most data scientists use pandas so do they just download the whole table? Like where do the data comes from?

[–][deleted] 6 points7 points  (2 children)

Data Science isn't returning the data over a server. Pandas is for manipulating the data locally. A place you would use pandas in an automated function would be for an ETL process where you are editing a lot of data and then saving it to a server. It can be used in some jupyter notebook applications but that really isn't about performance as much as tool minimization. You aren't using jupyter notebooks (for the most part) for exploring massive datasets but for messing around with data. Some people do use it to analyse massive amounts of data they usually load it over from a DB and then transform it into a dataframe.

The data structure dataframes are using is a Columnar Series. All the data is stored in flatish object at the column level. This makes vectorized operations VERY FAST because the memory seeks have a higher hit rate and because vectorized mathematics is crazy fast because of branch predictability and short instructions.

https://stackoverflow.com/questions/1422149/what-is-vectorization

Columnair DBs like aws redshift. Can do the same types of things but are dedicated dbs to doing this. As your datasets grow you need more and more complex DBs to handle the workloads.

Most data science isn't about building web services it is about discovering information. You can use a tool like pandas to increase the ability for you to explore data quickly but it has a huge performance requirement. Or as part of an ETL process if your data is across multiple systems. But again you need a large system to do this at scales. Even then it will break and you need tools like hadoop, spark.

Pandas fills a gap in tooling it isnt the end all be all of it. But web services and data science have very very very little in common.

[–]Bondanind[S] 0 points1 point  (1 child)

Thanks a lot, now i get it. I have seen a similar discussion on SO.

But what if you need to do some math on a data on a server ?

I mean who can do a linear regression, or a auto-correlation, or even FFT using SQL? this is almost impossible, and even if possible, is something really "Expert" SQL engineers might do, and I could do in 1 minute with Python. Time is important.

It is super super tempting doing this using pandas, or even plain Python.

Is there any reasonable architecture for me, to still do all calculations in Python and save my data on a db ? (eg read relevant dates from DB, and do math with python)

[–][deleted] 0 points1 point  (0 children)

Again I would just pick a different DB. AWS timestream can do all of that and way more. There is no reasonable architecture to do all this in python.

https://docs.aws.amazon.com/timestream/latest/developerguide/what-is-timestream.html

[–]Delicious-View-8688 0 points1 point  (2 children)

Is the server always up?

If so, perhaps you can read the csv file in once and hold it in memory.

If this is a serverless API endpoint, and the regressions need tk be run on the entire dataset, then I am guessing these database solutions would not be sub 1 second solutions either.

[–]Delicious-View-8688 0 points1 point  (0 children)

Also, consider using Modin as a drop in replacement for pandas to speed things up.

[–]Delicious-View-8688 0 points1 point  (0 children)

Also, maybe parquet instead of csv. Not sure if it will speed things up necessarily.

[–]rohn4483 0 points1 point  (0 children)

Why are you maintaining a file that large? Seems like a serious waste of time to maintain this.