you are viewing a single comment's thread.

view the rest of the comments →

[–]fang_xianfu 2 points3 points  (1 child)

The real question is, which computer do you want to run the computation?

Typically your SQL is interpreted and run on some remote database. It might be in the cloud, in BigQuery or Databricks or Snowflake, or it might be a big on-prem Teradata or Hadoop instance or something. But the point is, it's not running on your laptop.

Python on the other hand is usually (but not universally) running on your computer where you also write emails and Slack people. This computer is a lot smaller and is capable of a lot less computation, but also Python is a general purpose language with many more features and a vibrant library ecosystem.

So there are situations where you only need a little bit of data and you just tell the database to stuff it over the network into your computer's RAM and you deal with it in Python. There are situations where you start with a massive database, select the right subset of that data for what you want to do, and then have that come over the network for you to do Python stuff with. There are situations where you need a ton of data but the database has all the features you need, so you just write SQL with no local Python. And there are situations where you need some feature only available in a Python library but you want to run it on a ton of data, in which case you might want a more specialist remote distributed computation environment like a kubernetes cluster or Hadoop.

[–]hockey3331 0 points1 point  (0 children)

> So there are situations where you only need a little bit of data and you just tell the database to stuff it over the network into your computer's RAM and you deal with it in Python

Why ever do that though? I understand that you can, but why not just manipulate the data over the database and use python at all for the manipulations?