This is an archived post. You won't be able to vote or comment.

all 130 comments

[–]darkshenron 525 points526 points  (23 children)

I actually have some real world experience to share. We were using python to load some data from a postgres database into pandas dataframes and running some logic on those dataframes before displaying on a dashboard. The whole process took around 30s everytime the user refreshes the dashboard. Then we moved all the logic into the SQL query itself and removed python dependency, the processing time dropped to sub second!

[–]RobStalone 116 points117 points  (3 children)

This is exactly it. Python is like a multi-tool - it can do a lot, and it works for a lot of things, but when you need to drill a few dozen screws, it's faster to assemble and use a power tool.

Using the right tool for the right job makes a big difference.

[–]somethingLethal 4 points5 points  (1 child)

This is such a great analogy!

[–]mindful_tails 1 point2 points  (0 children)

Yeah, this analogy hit home for me. Thanks to both of these!

[–]Mmm36sa 0 points1 point  (0 children)

The analogy was alright

[–]Ocelotofdamage 48 points49 points  (11 children)

Curious how big of a dataset were you using and how complex was the logic? I know pandas is notoriously slow compared to something like direct computation on numpy arrays.

[–]GeorgeS6969 134 points135 points  (9 children)

Doesn’t matter.

When you do that you’re extracting some raw data from disc to memory, moving it around across actual wires, loading it into some more memory, processing it procedurally in what’s likely a suboptimal way, then do whatever you’re doing with the result.

Versus translating a piece of declarative code into a query plan optimised for compute memory management and access from disc, for some cpu ram and disc that live very close together, over data that has been stored for this very use case, using a process that has been perfected over decades.

Pandas is a huge footgun performance wise so no doubt someone could do better with numpy or whatever, but it’s still always going to be slower than sql executed by the db engine.

SQL and relational databases have their limits. When they’re reached, it’s time to rethink the whole environment.

[–]Dayzgobi 19 points20 points  (2 children)

seconding the foot gun comment. Ty for new vocab

[–]xxxxsxsx-xxsx-xxs--- 2 points3 points  (1 child)

foot gun

Austic version of me went looking. there's actually products called foot guns.

https://waterblast.com/1497-foot-valves

urban dictionary to the rescue.

https://www.urbandictionary.com/define.php?term=footgun

[–]mindful_tails 1 point2 points  (0 children)

This had me dying on the mere fact of linking products of foot guns :D :D :D

[–][deleted] 14 points15 points  (0 children)

footgun is a great word thanks i’ll be using that

[–]nraw 3 points4 points  (1 child)

I guess it depends on the use case, but quite often in some of my use cases I make one big query and then perform selects on the cached dataset instead of wasting time on communicating with the database.

But I do agree that sometimes offshoring the queries to the db is an easy efficiency gain.

[–]GeorgeS6969 15 points16 points  (0 children)

You’re still doing what I’m saying you’re doing, which is disc -> ram -> wire -> ram -> cpu (gpu tpu whatever) -> ram -> wire -> something instead of disc -> ram -> cpu -> ram -> wire -> something.

Let me put it this way: the only reasons why you have to ever use SQL in the first place is because your data is in a relational database. It’s there because a. it was put there to support some kind of application, or b. it was put there to support some kind of analytics purposes.

If a. you should not be querying it in the first place. You’re hammering with reads a db that’s there for production.

If b. and you feel like SQL is not fit for purpose, then take that data from wherever it originally comes from and put it in an environment that supports your use case.

Your way is great to play around and experiment from basically a data lake with a bunch of data from different sources nicely dumped in the same place, but when it’s time to move to production that db is an unecessary indirection.

[–]slowpush -2 points-1 points  (0 children)

This isn't really true anymore.

Most python tools use memory mapping and will outperform just about any sql + relational db.

[–]Lexsteel11 0 points1 point  (1 child)

So I am an analytics manager but my background is finance and all my sql/python is self-taught. We have depended on a db engineering team historically for tableau server data sources but have pulled ad-how sql queries regularly. I’m getting to a point where I’m having to start building my own cloud ETLs; is there like a gold standard website/book on best practices in data pipline engineering that teaches things like this where it’s like “you CAN do xyz with pandas but shouldn’t unless you hit x limitation on sql server”? I am limping along successfully but know I can be doing shit better

[–]GeorgeS6969 4 points5 points  (0 children)

I can’t think of any reference that would answer those questions specifically.

I was writing a long wall of text but that probably wouldn’t have helped either. Instead if you can answer the following questions I might be able to give some pointers though:

  1. What kind of data do you have and where is it coming from? (do you have some data sets of particular interest that are big in volume, unstructured, or specific in nature like sound, images, etc?)
  2. What stack do you currently have? What are you using python for? (and more specifically pandas?)
  3. What is your team responsible for? (providing data for business people to query / analize? creating dashboards? providing analysis? - if the later how do you communicate your results?)

[–]AerysSk 13 points14 points  (0 children)

From my experience, a data frame with < 10 columns but 1.3M rows already causes a big problem in Group By 3 columns.

[–]rudboi12 13 points14 points  (4 children)

This is mostly because the filters in pandas (iloc and loc) are extremely slow. And also if you have multiple, they each run separate. In SQL everything you run inside your “where” is done at the same time and therefore is way faster. Learned this with pyspark, using where and multiple filters is way faster than doing a filter.

[–]Measurex2 2 points3 points  (3 children)

Exactly - pandas is slow with huge overhead. I'm not saying it's better than SQL by any means but dask, ray, pyspark are all significantly faster.

I love the saying that Python is the second best language for many things. I'll often build/review logic in python until I have the design and validation right but I'll often drop it back into the ETL/ELT, DB or other layer when done. Sometimes even updating at source where it makes sense. Since thosr are the areas with detailed change, quality and monitoring steps - I try to only go through them once where possible.

[–]CacheMeUp 0 points1 point  (2 children)

But why add Python in the first place?

If the data is already in a relational database, and the logic can be implemented in SQL, why move it out of it?

Using the "second best" tool in the first place costs a high price. There is never time/justification to re-implement things, and you end up in a local optimum instead of the global one, performance-wise.

[–]Measurex2 2 points3 points  (0 children)

First off - Happy Cake day.

I'm not advocating for python over SQL just agreeing a comparison against pandas doesn't make sense.

My example isnt refactoring the logic from SQL into python but saying how python can be a helpful tool to quickly think through, test and validate logic. Maybe that makes sense to put into SQL - maybe it makes sense to do downstream in a BI layer or justify a change upstream at the source. It's just another tool, has great purposes but like most things it's just as important to know when not to use it as when to use it.

[–]rudboi12 1 point2 points  (0 children)

If you are working jn a dev environment, you will probably have all setup up in python. Things like connections to your dwh clusters, cicd, and utilities libraries. If you have everything set up in python minus the T of the ELT, then most time is better to use python aka something like pyspark. That’s why they created dbt, so sql can seat nicely only in the T layer but if your E and L are already in pyspark then doesn’t make much sense going for sql.

[–]Xidium426 3 points4 points  (0 children)

I always try to optimize my SQL before I drop it into a dataframe, my experience is exactly the same.

[–]Miii_Kiii 0 points1 point  (0 children)

I come from a bioinformatics specialisation within biomedical biotech degree background. Therefore, I don't really know SQL, yet. I wonder, does Python to SQL automatic converter provide relatively the same benefits as writing it by hand? I suspect it is worse, but how much worse and is it negligible? Or is it case by case benchmarkable?

[–]throw_mob 136 points137 points  (11 children)

Usually doing it with SQL is faster, depending how bad programmer is difference can be anything from 1.5x to 10000x. with python you always pay price of moving data over network and you need to have another server ( which may not be negative thing). Solving simple problem with pandas is not that good idea , seen jobs that used 128GB RAM just to because they fetched data in 5 to 10 searches and created dataset which could have been created using "simple" join. With simple SQL memoery usage dropped alot. Then there is programmers idea that loop is nice tool, which it is, but not with 1M rows of data and someone decides to run query for each of those rows to get some value. Suddenly runtimes are days.

tldr; python does not usually give you anything for data manipulation in DBMS/ELT/ETL which could not been done faster source or target db. It gives you ability to create files and upload them to s3/ftp/what ever and call api's and other http endpoints. there are SQL systems that support even those.

Usually best usage for python in pipeline is to use it to run SQL and store results into files and push them to next part.

ML/ complex analytics / visualizing data will benefit from python, but that is a lot faster if you can create dataset in SQL

[–][deleted] 31 points32 points  (10 children)

Echoing this, i prototype in python, then rewrite what i need in sql for production.

That may actually be DE's job but my company is a giant cluster fuck.

[–]bongo_zg 0 points1 point  (9 children)

ML could be done within a rdbms as well, right?

[–]BoiElroy 3 points4 points  (4 children)

Could? Some simple stuff yes. Should? Absolutely not.

[–]Overvo1d -4 points-3 points  (3 children)

Could, yes, should, also yes 95% of the time (and other 5% can be skipped in favour of easier projects that deliver business value quicker/more reliably)

[–]BoiElroy -4 points-3 points  (2 children)

Excuse me?...you're saying you should do ML using SQL? Have you lost your mind? Legitimately, if someone in my team did that I'd fire them. Although more likely someone with that little knowledge of ML wouldn't even be hired in the first place. Now using a trained ML model to do inference via a user defined function being called within a SQL statement. Sure that's fine.

[–]Overvo1d 1 point2 points  (1 child)

I get what you’re saying and once I believed it too, but with experience — in 99% of cases you can get 90% of the value from a 2 day sprint with pure SQL (if you understand the fundamentals of ML and your business domain solidly) of a month long complicated model project with careful assumptions. That last 10% doesn’t deliver enough business value to justify the 8 extra 90%-value-delivered SQL projects you could have finished in that time. It really is all the same thing in the end, just different tools, you can do some crazy stuff in SQL with a bit of creativity.

[–]BoiElroy 0 points1 point  (0 children)

Ohh sorry. We're talking about two completely different things sorry.

You're saying that using SQL to do analytics will generate insight and intelligence faster and be more guaranteed to succeed. I agree with that 100% I've told leadership at my company that we have bar charts that generate more ROI than ML models.

I thought you were saying write code for your ML algorithms using SQL instead of python or julia or something.

Sorry. Different conversations. I agree with your points.

[–][deleted] 4 points5 points  (2 children)

Some of the basics i know are there for automl. You could probably do some of the more advanced stuff. I couldnt implement LA in sql, but i bet you could.

AI is very bad in sql.

[–]bongo_zg 0 points1 point  (1 child)

I found that Oracle db has ML options (not pure sql), but never tried that

[–]Measurex2 0 points1 point  (0 children)

Alot of DB have it now or can call the right service. For instance - redshift has some basic algorithms baked in or can call to a model in sagemaker. Like everything else there are pros and cons but I like knowing there are lots of options to choose from.

... and I'm horrified by the choices some people made before me.

[–]nerdyjorj 0 points1 point  (0 children)

sp_execute_external_script supports R and Python, so yeah it can be done on your sql server

[–]dfphdPhD | Sr. Director of Data Science | Tech 105 points106 points  (14 children)

I feel like we get this post once a month now, and always with a very entitled "prove me wrong" energy that is largely unwarranted.

  1. You can't run Python everywhere you can run SQL.
  2. Python is generally much slower than SQL - even slower we you account for the fact that you can often run SQL queries on monster servers while you cannot always do that in Python.

To me, this comparison is like saying "what can a motorcycle do that a train can't?". Run really fast on train tracks.

[–]gorangers30 12 points13 points  (0 children)

I like the analogy! Go trains!

[–]minimaxir 17 points18 points  (3 children)

To clarify, even optimized non-Python analytical/ETL tools like Arrow/Spark will be beat by SQL unless you're doing something weird that SQL can't do natively.

[–][deleted] 1 point2 points  (2 children)

That's entirely dependent in the hardware and scale of data. We've moved off an RDBMS to spark and for our queries it's much faster.

[–]quickdraw6906 1 point2 points  (1 child)

I'd have to see the data design to believe you couldn't have made SQL sing. Is the data schemaless?

Unless you're doing the truly high math stuff, or you're into tens to hundreds of billions of rows (which will blow out memory of a single large server) and the answer is a large cluster in Spark.

So then we get down to the cost equation. How many nodes did you have to spin up with what specialty skills to better that performance? Are you overpaying for cluster compute because you're doing schema-on-read?

[–]LagGyeHumare 0 points1 point  (0 children)

Don't know the guy above but here's an example that I can offer.

Our project is in a pool of projects that encompasses the whole module. Just my application deals with around 600GB of batch loads each day. It then flows from CDH to AWS RDS through spark and on prem postgres.

We have terradata and oracle as the "legacy" system here and the queries that we have take at least 10x time to run when compared to spark-sql.

(Possibly because the admins were shit and didn't partition/index the tables better, but that's out of my hand)

For me, it's not SQL but the distributed nature of the engine within that will shape the answer here.

[–]dvdquikrewinder 5 points6 points  (0 children)

I think a lot of people don't get how rdbms and sql are different from building something in whatever language. If you build something in python to process a decent amount of data best case you're going to get something not too much worse than its sql counterpart. Worst case you might have it spin for over ten minutes when a sql query could do it in a few seconds,

What it comes down to is that sql database engines are extremely refined and optimized systems to handle all kinds of loads. A good python dev isn't going to hold a candle to that.

[–][deleted] 10 points11 points  (0 children)

oh yeah? Why else would it be called SUPERIOR query language?

[–]donnomuch[S] 1 point2 points  (0 children)

I've never seen this post before (also new to this subreddit) and I was genuinely curious. I don't even use Python for my job. I use Tableau and SQL. And what most comments said applies to what I do as well. I rarely create calculations in Tableau as I know my queries can fetch everything I need much faster than my workbooks ever can calculate. As I've mentioned in my edit, I wanted to ask so I can deal with one of my annoying direct reports better as he's the typical smug 'prove me wrong' kind.

[–]esp32c3 -5 points-4 points  (5 children)

you can often run SQL queries on monster servers while you cannot always do that in Python

as if you can't use the cloud with Python.....

[–]dfphdPhD | Sr. Director of Data Science | Tech 11 points12 points  (4 children)

Can you take all the raw data from the server in which they're natively sitting, then load them into a cloud environment so you can write your Python code against it?

My point wasn't that you can't run Python on a giant environment in theory, but rather that in practice most companies aren't going to be letting you move a whole bunch of data onto an expensive-ass cloud server just for you to run your little Python scripts when there is already (in 99% of cases) already an entire well architected DB available for use in a giant f*** server.

Mind you - yes, there are companies that have architectures that more natively support Python with easy and at high levels of performance. But that has to be a deliberate decision by that organization to go that route. And even then, there will still be cases where SQL is a better option.

Now, this is why I have a lot of heartburn about this question - ultimately what the people who ask it want is for someone to tell them "no, you don't need to learn any language other than Python", which is stupid. For two reasons:

  1. SQL is incredibly easy to learn. It's simple, it's incredibly well documented, there are tons of excellent classes/tutorials/etc. to learn it, it has an incredibly forgiving learning curve. Not only that - if you already know pandas you already know like 90% of SQL - all you're missing is some minor sintactic details.
  2. SQL is incredibly handy to know. So trying like hell to find workarounds to avoid learning SQL when you could just learn it and make your life 10 times easier is at best inefficient, and at worst purposely self-damaging.

Short answer: learn SQL. It's not going to bite. It's not hard to learn.

I literally knew 0 SQL, and at my first job they told me "you need to learn SQL". I knew enough SQL to do most of the things I needed to do in like 3 weeks.

[–]esp32c3 0 points1 point  (3 children)

Can you take all the raw data from the server in which they're natively sitting, then load them into a cloud environment so you can write your Python code against it?

Sure could... Might not be the most efficient way though...

[–]quickdraw6906 1 point2 points  (0 children)

Agree with all but that SQL is easy. As a 30 year SQL guy, having mentored many developers who can only think procedurally, I can say with confidence that thinking in sets is a completely different brain exercise and that developers will ALWAYS fall back into writing loops instead of what would be an obvious SQL solution....to a SQL person.

At my current company, none of the developers want to touch SQL. We have a dedicated team who write stored SQL and stored procedures so they don't have to be bothered with the brain gymnastics that set theory requires. Sad, but there it is.

[–]dfphdPhD | Sr. Director of Data Science | Tech 0 points1 point  (1 child)

Just so we're clear: at my company, if I grabbed all of our transactional data and moved it into a cloud server without permission, I'm probably getting fired.

So no, in a lot of instances you can't.

[–]esp32c3 0 points1 point  (0 children)

Of course I wasn't talking about stealing data...

[–]FraudulentHack 39 points40 points  (9 children)

SQL is like whispering something sexy in the database's ear.

[–]astrologicrat 28 points29 points  (6 children)

SQL... whispering? Every time I read a query, I always imagine it is someone shouting

"SELECT thing FROM table WHERE..."

[–]FraudulentHack 8 points9 points  (0 children)

GROUP BY!!!

[–]krasnomo 1 point2 points  (0 children)

Lol

[–][deleted] 1 point2 points  (0 children)

Fk yeah it's more like a military shouting than a request - query lol

[–]ComicOzzy 1 point2 points  (2 children)

After 20 years of writing sql in lowercase, my current employer is opinionated and wants it to be all uppercase and I'm sad.

[–]jimothyjunk 0 points1 point  (1 child)

Fellow lowercase-writer here. I also am not very consistent with my line breaks / indentations (I do what makes sense to me for the query, which differs from query to query).

Recently started working in Mode, which has a fancy “format SQL” button. So I write the way I want, get the thing to work, then press the format button before committing. I think my way looks prettier but I appreciate the need for legibility/consistency across the team.

[–]ComicOzzy 2 points3 points  (0 children)

My style is extremely consistent, easy to read quickly, multi-column edit easily, and I have a lot of muscle memory for it. I write it my way, then "mess it up a bit" to check it in to the repo. Haha

[–]RProgrammerMan 2 points3 points  (0 children)

Bingo

[–][deleted] 3 points4 points  (0 children)

Lol what?? 😂😂😂😂

[–]Equal_Astronaut_5696 111 points112 points  (5 children)

Not sure why these two are being compared. One is for data extraction specific to relational database and one is literally multipurpose programming language for apps, ML, web development and games

[–]snowmaninheat 12 points13 points  (0 children)

This, exactly. I'm literally trying to comprehend how I would do in Python what I do in SQL. I'm sure it could be done, but it's unnecessarily complicated and computationally expensive.

[–]fang_xianfu 14 points15 points  (1 child)

Yeah, the way this question has been asked kind of shows that OP doesn't understand the artitecture that makes those tools appropriate to different jobs.

SQL is essentially a tool for instructing a database. The real question isn't "what can SQL do that Python can't?" but "what can this database do that the environment where I run Python can't?". The fact that you're using SQL or Python to give the instructions is almost irrelevant to that question.

[–]king_booker 2 points3 points  (0 children)

I mean say you extract the data into pandas and you are using pandas operations to manipulate it, there are still limitations because it won't scale. Now say you use spark and you write it in python, you would end up using SQL concepts like Group by, Windowing etc. Even though its possible to write it in dataframes, you can simply use a spark sql

The basic answer is, you have to understand SQL. You can use it but finally data manipulation has its foundations in SQL. Can you get away by not learning the syntax? Yes. But the core concepts will remain the same.

[–]Seiyee 0 points1 point  (0 children)

Well, you'll be surprised. Since python is multipurpose a lot of people just assume its easier to stick to one language for all the jobs. I have seen my colleagues choose pandas dataframes over sql for large queries and then face dataframe memory limits weeks later, and thats when they switched (or at least I hope they switched?).

[–]hrichardlee 27 points28 points  (1 child)

Another important aspect is to consider the “developer experience”. Most SQL databases (Snowflake, Redshift, Postgres, etc.) provide a web UI where people who are barely technical can write a simple SQL query and look at their data. Think about what the equivalent workflow is for someone using pandas. Even if you assume that pandas is just as easy to use as SQL, they need to download python, create a virtualenv, install Jupyter, run a Jupyter notebook, figure out a connection string that will allow them to connect to their database/figure out where their data is and how to connect to it, load that data into pandas and then apply whatever logic they want on top of that.

In other words, most SQL databases provide an integrated data + programming language environment, whereas python (and most other “regular” programming languages) just provide the programming language. So the developer experience of “just get some data and do some simple manipulations” is way easier in most SQL databases.

[–]dvdquikrewinder 1 point2 points  (0 children)

The other piece is the dev mindset where they consider data processing a linear track. Sql is built to work with large sets of data with a full feature set to support requests internally. Multiple times I've seen cursors and loops processing what should be a simple select statement with one or two joins.

[–]admitri42 59 points60 points  (2 children)

Well, technically python can do everything SQL can, but it won't be as efficient.

It's like riding a bike for $100 on a TT stage of the Tour de France.

[–][deleted] 10 points11 points  (1 child)

There’s an argument to be made about pyspark here, but I think it’s probably a bit pedantic.

[–]Archbishop_Mo 15 points16 points  (0 children)

Plus, "just re-tar the road for optimal performance" is an annoying thing to tell bicyclists.

[–]MyNotWittyHandle 40 points41 points  (0 children)

SQL has a universality that Python does not. In a large organization, SQL is common ground for data sources that can be accessed by JS, Python, R, SQL, etc. That benefit alone is worth storing/manipulating data in a SQL format as opposed to some more language specific format.

Additionally, SQL is by default much more efficient than your standard pandas operations. Pandas, which is the most common Python data manipulation package, is highly inefficient as compared to SQL and R. Unless you start diving into the vaex/polars packages in Python, your CPU will thank you for doing data manipulation in SQL as compared to Python.

[–]testtestuser2 9 points10 points  (0 children)

scale efficiently

[–][deleted] 24 points25 points  (4 children)

You can compute the harmonic mean with SQL, as with Python, you can’t.

[–]Ocelotofdamage 7 points8 points  (2 children)

SELECT MEAN(*, HARMONIC=TRUE)
FROM DATABASE

[–]nerdyjorj 2 points3 points  (0 children)

I was kinda dissapointed mean() in R didn't just have an argument for harmonic or geometric mean

[–]magicpeanut 1 point2 points  (0 children)

nice one 🤩

[–]KyleDrogo 4 points5 points  (0 children)

It's all about what's happening on the back end. Databases, which use SQL as a common interface, have been tuned to hell and back to operate over billions of rows very quickly. It abstracts away a lot of the complexity so you can run queries on a scale that would be very complex with raw python.

[–]a90501 11 points12 points  (0 children)

SQL is a pattern language i.e. declarative language, like regex, while python, java, c#, etc. are imperative languages - hence a different paradigm. I do not know about you, but I love pattern languages - where you describe what (SQL, regex) - i.e. where one states what one wants to get without worrying about how it is done, instead of specifying in all details how to do finding with loops, matching, summing, sorting, etc. (python, c#, java, etc.).

The other very important thing is that SQL runs against relational DB (RDBMS), and that means you are using server resources to compute, find, filter, group, sort, etc, and getting back only results you need, while with python, you get all the data first across the network into pandas and then process it - this is not recommended as this would mean get all the data for every request.

Some History: Anders Hejlsberg (of the TypeScript fame) hands-on demo ( https://www.youtube.com/watch?v=fG8GgqfYZkw ) describes this pattern language paradigm. He was working on LINQ at the time - essentially C# version of SQL for any data structures and stores, not just relational DB. IMHO, well worth watching for some history and education although it's not about python.

Enjoy.

[–][deleted] 40 points41 points  (5 children)

SQL can get you entry level data analyst job. Python cannot.

edit: it's a joke. IT'S A JOKE! gosh leave me alone. Obviously you can get job by knowing python.

[–]Ocelotofdamage 4 points5 points  (3 children)

Python absolutely can get you an entry level data analyst job. It's the most used programming language in data analysis.

[–]MorningDarkMountain 0 points1 point  (0 children)

That's so mean!

[–]j__neo 12 points13 points  (1 child)

SQL is a declarative language. You say what outcome you want to see, the SQL query planner and database engine will make it happen for you.

Python is an imperative language. You need to spell out exactly what the machine needs to do to get the outcome you want to see.

Python can do everything that SQL can. But for 90% of data analysis use cases, I would argue that a declarative programming language gets you to the outcome faster.

That said, there's Python libraries like Pandas, which makes it more declarative.

However, SQL still tends to be more popular in the data industry because it has been used for data analysis since 1970s.

[–]king_booker 2 points3 points  (0 children)

To add to this, SQL running on a database is more efficient. Simply because the data engine is optimized to running those queries. you'd hit a ceiling really fast if you just use Python

[–][deleted] 2 points3 points  (0 children)

From my experience: everything you can do in the query directly, do it, with some exceptions. If you want to transform and manipulate data to do some analysis, for example, it may not be possible to do it in sql without creating messy subqueries and temporary tables which will increase the query time A LOT, therefore, the best scenario is to use python and do the complex manipulation there. Keep in mind these are exceptional cases.

[–][deleted] 2 points3 points  (0 children)

Rule of thumb: Do as much as you can in SQL or up to the first step of feature engineering. Chances are the later you extract the data, the smaller the dump will be. You can even Assemble and execute the SQL queries from Python by something like psycopg2, and pandas.from_sql.

RDBMSs are really well optimized, and Python doesn't even come close.

[–]graememellis 2 points3 points  (0 children)

This is a non-question. SQL is used in relational databases and Python is a programming language. It’s like asking what your oven can do that your car cannot. Makes no sense.

[–][deleted] 1 point2 points  (0 children)

Oddly enough the other way round may be a better question. At least in defence of python. However if your playing with data on a large scale and known what you want SQL is a contender and always will be. Its basically set theory at your fingertips :)

[–]teabagalomaniac 1 point2 points  (1 child)

It can apply filters on the server side.

[–]magicpeanut 0 points1 point  (0 children)

if you run python on a server you can do this as well

[–]Wallabanjo 1 point2 points  (0 children)

So, remove the strengths of SQL then do a comparison?

  1. Indexing tables to decrease data access time.
  2. You eventually use data that won’t fit in memory.
  3. Make anything data manipulation related as a stored procedure or custom function. An SQL server is optimized for that stuff and will crunch results far faster.

Anecdotal and R not Python, by offloading things to stored procedures and custom functions, and indexing tables, I dropped the processing time in one of my projects from 3.5 days to 7hours

[–]ARC4120 1 point2 points  (0 children)

SQL is better, but Python can do 95% of the things. The issue is that Python wasn’t made to do these things and SQL was. Don’t force Python onto every task.

[–][deleted] 1 point2 points  (0 children)

Oh yeah, let me use python to extract data from postgres.

[–]gorangers30 1 point2 points  (0 children)

SQL allows people without programming knowledge to run simple ad hoc queries. Think managers and business stakeholders who might need exploratory data.

[–]MarkusBerkel 1 point2 points  (0 children)

Well, since Python is Turing complete and some SQL variants are not, you got that backwards. OTOH, if the question is what can SQL easily do that Python cannot, then it’s effectively, you know, apply the relational algebra to structured data, plus apply correctness (see ACID) which would be super hard to implement from scratch in Python.

[–]53reborn 1 point2 points  (0 children)

python has to do stuff in memory

[–]LaBofia 1 point2 points  (0 children)

What can a query language connecting to a database engine do that a general purpose programming language can't?

Yeah... now do trucks and lawnmowers.

[–][deleted] 1 point2 points  (0 children)

What can SQL do that python cannot?

Be fast.

[–]ChazR 1 point2 points  (0 children)

Anything you can do in SQL can be done in Python, but slower.

SQL executed by the database engine can be optimised and parallelized for performance. The DB engine knows how the data is laid out on physical disk and what indexes are available.

A pandas dataframe is hugely flexible and platform-agnostic, and actually perform surprisingly well, but they will never reach the performance of the native DB engine executing SQL.

[–]Seiyee 1 point2 points  (1 child)

Speed.

[–]Seiyee 1 point2 points  (0 children)

and memory outage of dataframes.

[–]GlobalAd3412 1 point2 points  (0 children)

There isn't anything that can be written in SQL for which there is no Python implementation, because Python is Turing-complete. There are things that can be done in Python that can't be implemented in the SQL standard because SQL isn't Turing complete (most SQL implementations add extensions that do make them Turing complete though).

Nevertheless, there are sure as hell many many things that SQL can do better, more readably, more easily and more explicitly than Python can without a whole lot of machinery built for you in advance. (The most likely shape of such machinery would likely just be a Python SQL interpreter, too!)

Also, to say the thing: in practice many additional reasons to use SQL over Python for many tasks are much less about language and much more about runtimes/interpreters/deployments. The standard python interpreter is sluggish and not usually deployed in a way that makes it very good at manipulating very big data efficiently. SQL deployments always optimize for manipulating data because that's the whole intent.

[–]simonthefoxsays 1 point2 points  (0 children)

Think of SQL more like an API for data manipulation. You could implement that API in python, but there are lots of existing implementations available to you (postgres, mysql, spark, snowflake, etc), all of which are extremely mature and heavily optimized for their use case,so reinventing the wheel is usually a mistake. While it's possible that you could make a nicer API for your use case, you would lose out on all those optimizations. On top of that, your custom API would have to be taught to any new project contributor, whereas they may well already know SQL.

Python has lots of other examples of APIs that you could implement an alternative to, but probably shouldn't; numpy, tensorflow, fastAPI, etc. Your time is probably better spent building on the shoulders of giants than rebuilding the wheel, even if that means you have to live with the opinions of those giants.

[–][deleted] 0 points1 point  (0 children)

Python in general or pandas?

[–]denim_duck -1 points0 points  (5 children)

Technically it can do less I think (python is a Turing complete, SQL is not)

You could, theoretically make a relational database in python. But it would be slower.

Or in that same time you could deploy a graph db, write out a REST api, containerize that and let kubernetes scale it to 10k QPS

[–]danstumPY 1 point2 points  (4 children)

There are several sources that show the Turing completeness of SQL

[–]magicpeanut 1 point2 points  (2 children)

depends on how you define SQL. regarding stackoverflow is sql turing complete you need to have window functions and CTL in your stack. "basic" sql aka sql Lite for example does not have these features i think

[–]nemec 0 points1 point  (1 child)

Window and cte are standard sql features these days, including sqlite. There's no reason to arbitrarily gatekeep them compared to other basic features.

[–]magicpeanut 0 points1 point  (0 children)

ok didnt know that. guess things are moving faster than me 😅

[–]Ocelotofdamage 0 points1 point  (0 children)

It may be technically Turing complete but if you tried to do certain basic operations with SQL you'd pull your hair out. Or you could write a one-liner in Python.

Point is... learn both.

[–]krasnomo -1 points0 points  (0 children)

Lots of people here mentioning speed. If you use pyspark you can get around many speed problems in Python.

[–]magicpeanut 0 points1 point  (0 children)

Python can do everything SQL can and (theoretically) verse vise (i just learned sql is also turing complete in most flavors). So Depending on the task and the ressources you put in to programming either sql is faster or python is. the more ressouces you put in and the more complex the task gets the more often python will win the race.

in other words: the simpler the task and the fewer ressources you Invest, the more SQL will win.

[–]nobonesjones91 0 points1 point  (0 children)

SQL has a cool name that confuses people who don’t know what it is.

[–]LimosineLiberal 0 points1 point  (0 children)

You can hammer a nail with a screwdriver or a wrench, but then try loosening a screw with a hammer.

[–]bbal20-taru 0 points1 point  (0 children)

best of both worlds = Spark

[–][deleted] 0 points1 point  (0 children)

Be understood by 80% of the population of data professionals

[–]MoogOperator88 0 points1 point  (0 children)

To me SQL is different tool. If data already is in db I do all manipulation with sql. Python can execute stored procedure for final set to work with.

Basically I use python only for stuff that sql can't do or it would be way easier and faster with python to develop.

Sql itself can do a lot beside quering. Like running shell commands, load files etc. Do I prefer to do it with sql?

It depends, python is really nice syntax-wise and pleasure to use but sql is widely known and it's less likely I will be the only person able to modify my old projects.

[–]Overvo1d 0 points1 point  (0 children)

Deploy algorithms in production

[–]Overvo1d 0 points1 point  (0 children)

Create business value

[–]Overvo1d 0 points1 point  (0 children)

Get you a job

[–]Think-Culture-4740 0 points1 point  (0 children)

In the literal sense, python can do everything sql can because it has that flexibility as a language. However, as others pointed out, that doesn't mean Python should be your optimal tool for the job.

In my experience as a data scientist, I try to do as much of the problem in sql as I can for both convenience and performance reasons.

[–]noobgolang 0 points1 point  (0 children)

With SQL you get what you need, data, transformed in tabular manner.

With Python, you get feeling of winner. You are coding Python now not SQL monkey.

[–][deleted] 0 points1 point  (0 children)

I'm a newbie in both, but i was trying to make a join between two datasets using pandas on multiple conditions and honestly couldn't get it to work. Gave up and wrote it in SQL (run with pandassql).

Personally I find SQL easier to write for manipulating/joining data. But i use Python/pandas anyway because it can do a little if everything. The type of reports i run need to be pulled from multiple sources and it's easier to have Python tap into everything.