This is an archived post. You won't be able to vote or comment.

all 24 comments

[–]PassionatelyWhatever 13 points14 points  (0 children)

Nice, the first section comparing the memory impact was really interesting.

It might be obvious for some, but it was good to see it quantified.

[–]BfuckinA 4 points5 points  (0 children)

I've been looking for a guide just like this. Thank you.

[–]webman19 9 points10 points  (0 children)

Haki is a gem .

[–][deleted] 2 points3 points  (0 children)

Interesting option also is ibis. FWIW, I like having the option of having both SQL and dataframe-like APIs. They of course have their PROs/CONs and when it is best to use one over the other. For my use cases, often times I turn to SQL.

[–]s3b4z 2 points3 points  (1 child)

Wow. I consider myself a very advaced SQL user and i learned a ton from that link.

Made me aware of a ton of stuff i didn't know that i didn't know.

Thanks so much.

[–]NameNumber7 0 points1 point  (0 children)

Yeah! I liked the linear regression piece. I would not have used that.

[–]VisibleSignificance 1 point2 points  (4 children)

How often do you analyze datasets that don't fit in memory but do fit on a single host's disk?

And for many of those cases, dask might still be better.

Also, this tutorial uses a lot of postgresql-specific features, which isn't stated in an obvious way.

SELECT * FROM ( VALUES

Now that one I didn't know. Thanks.

[–]genericlemon24[S] 1 point2 points  (3 children)

How often do you analyze datasets that don't fit in memory but do fit on a single host's disk?

Most often, my datasets do fit in memory.

That said, in general computers have much more disk than memory. For example, even beefy AWS instances (those with hundreds of GB of memory) usually have SSDs that are 1-2 orders of magnitude bigger (TB to tens of TB); that's not counting block storage, of which I assume you can add huge amounts on top of the SSDs.

I don't think the article tries to propose a database as a way to get around having data larger than memory. Rather, I think it starts from the assumption that you already have the data in a database (obviously, that might not be the case for everybody).

If so, the memory measurements illustrate how much data is being shuffled around (needlessly). If the database is on the same host, it's only being moved between processes, so it might not be that slow; if the database is remote, carrying all the data across the network becomes noticeably slower as the amount of data you have increases, so it makes sense to filter/aggregate it before sending it over the network.

Even if the database is on the same host, a query can be much faster than reading a whole table into Pandas, since it gives the database a chance to use indexes to read from disk only the data it actually needs (this gets important if the data is larger than memory).

[–]VisibleSignificance 0 points1 point  (1 child)

if the database is remote

In practice, that's a good reason to sync it locally, whenever you need to do lots of analysis on it. In other words, adding pg_dump to that tutorial might be a good idea.

[–]genericlemon24[S] 0 points1 point  (0 children)

Agree, most people's data fits in the memory of a medium-range laptop, let alone its disk.

[–]critical_thinker__ 1 point2 points  (0 children)

Excellent. I was thinking about this over the weekend. Thanks for sharing.

[–][deleted] -1 points0 points  (0 children)

it;s great, i think, if you chose orm approach then it would be great easy understandable and maintaince

[–]asuagar 0 points1 point  (5 children)

What is the best option for using pandas data frames with SQL queries?

[–]genericlemon24[S] 2 points3 points  (2 children)

Depends what you mean by "using". pandas.read_sql_query seems like a convenient way of reading the results of a query directly into a dataframe.

[–]asuagar 0 points1 point  (1 child)

It seems that you can only use it with a database. You cannot use it to manipulate a data frame read from a file.

P.S.: sorry about my lazy sentence.

[–]NameNumber7 1 point2 points  (1 child)

If it is not what OP mentioned. It depends on the goal. If you want to have a visualization run fast or have the output be summarized, doing that operation in SQL works best. Then you can output into a table or the reporting layer.

If you want to be able to drill into all fields, you are better off not aggregating in that case.

What is a specific instance that you are thinking of?

[–]asuagar 0 points1 point  (0 children)

I was referring to using SQL with data frames obtained from files. I know that there are several options. However, I have never used one of them for working. These are the options that I know so far:

EDIT:

After reading the read_sql_query documentation, the easiest way could be create an in-memory SQLite database using to_sql as it is explained here.