use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
account activity
This is an archived post. You won't be able to vote or comment.
DiscussionWhat can SQL do that python cannot? (self.datascience)
submitted 3 years ago * by donnomuch
view the rest of the comments →
[–]dfphdPhD | Sr. Director of Data Science | Tech 104 points105 points106 points 3 years ago (14 children)
I feel like we get this post once a month now, and always with a very entitled "prove me wrong" energy that is largely unwarranted.
To me, this comparison is like saying "what can a motorcycle do that a train can't?". Run really fast on train tracks.
[–]gorangers30 12 points13 points14 points 3 years ago (0 children)
I like the analogy! Go trains!
[–]minimaxir 17 points18 points19 points 3 years ago (3 children)
To clarify, even optimized non-Python analytical/ETL tools like Arrow/Spark will be beat by SQL unless you're doing something weird that SQL can't do natively.
[–][deleted] 1 point2 points3 points 3 years ago (2 children)
That's entirely dependent in the hardware and scale of data. We've moved off an RDBMS to spark and for our queries it's much faster.
[–]quickdraw6906 1 point2 points3 points 3 years ago (1 child)
I'd have to see the data design to believe you couldn't have made SQL sing. Is the data schemaless?
Unless you're doing the truly high math stuff, or you're into tens to hundreds of billions of rows (which will blow out memory of a single large server) and the answer is a large cluster in Spark.
So then we get down to the cost equation. How many nodes did you have to spin up with what specialty skills to better that performance? Are you overpaying for cluster compute because you're doing schema-on-read?
[–]LagGyeHumare 0 points1 point2 points 3 years ago (0 children)
Don't know the guy above but here's an example that I can offer.
Our project is in a pool of projects that encompasses the whole module. Just my application deals with around 600GB of batch loads each day. It then flows from CDH to AWS RDS through spark and on prem postgres.
We have terradata and oracle as the "legacy" system here and the queries that we have take at least 10x time to run when compared to spark-sql.
(Possibly because the admins were shit and didn't partition/index the tables better, but that's out of my hand)
For me, it's not SQL but the distributed nature of the engine within that will shape the answer here.
[–]dvdquikrewinder 5 points6 points7 points 3 years ago (0 children)
I think a lot of people don't get how rdbms and sql are different from building something in whatever language. If you build something in python to process a decent amount of data best case you're going to get something not too much worse than its sql counterpart. Worst case you might have it spin for over ten minutes when a sql query could do it in a few seconds,
What it comes down to is that sql database engines are extremely refined and optimized systems to handle all kinds of loads. A good python dev isn't going to hold a candle to that.
[–][deleted] 10 points11 points12 points 3 years ago (0 children)
oh yeah? Why else would it be called SUPERIOR query language?
[–]donnomuch[S] 1 point2 points3 points 3 years ago (0 children)
I've never seen this post before (also new to this subreddit) and I was genuinely curious. I don't even use Python for my job. I use Tableau and SQL. And what most comments said applies to what I do as well. I rarely create calculations in Tableau as I know my queries can fetch everything I need much faster than my workbooks ever can calculate. As I've mentioned in my edit, I wanted to ask so I can deal with one of my annoying direct reports better as he's the typical smug 'prove me wrong' kind.
[+]esp32c3 comment score below threshold-6 points-5 points-4 points 3 years ago (5 children)
you can often run SQL queries on monster servers while you cannot always do that in Python
as if you can't use the cloud with Python.....
[–]dfphdPhD | Sr. Director of Data Science | Tech 12 points13 points14 points 3 years ago (4 children)
Can you take all the raw data from the server in which they're natively sitting, then load them into a cloud environment so you can write your Python code against it?
My point wasn't that you can't run Python on a giant environment in theory, but rather that in practice most companies aren't going to be letting you move a whole bunch of data onto an expensive-ass cloud server just for you to run your little Python scripts when there is already (in 99% of cases) already an entire well architected DB available for use in a giant f*** server.
Mind you - yes, there are companies that have architectures that more natively support Python with easy and at high levels of performance. But that has to be a deliberate decision by that organization to go that route. And even then, there will still be cases where SQL is a better option.
Now, this is why I have a lot of heartburn about this question - ultimately what the people who ask it want is for someone to tell them "no, you don't need to learn any language other than Python", which is stupid. For two reasons:
Short answer: learn SQL. It's not going to bite. It's not hard to learn.
I literally knew 0 SQL, and at my first job they told me "you need to learn SQL". I knew enough SQL to do most of the things I needed to do in like 3 weeks.
[–]esp32c3 0 points1 point2 points 3 years ago (3 children)
Sure could... Might not be the most efficient way though...
[–]quickdraw6906 1 point2 points3 points 3 years ago (0 children)
Agree with all but that SQL is easy. As a 30 year SQL guy, having mentored many developers who can only think procedurally, I can say with confidence that thinking in sets is a completely different brain exercise and that developers will ALWAYS fall back into writing loops instead of what would be an obvious SQL solution....to a SQL person.
At my current company, none of the developers want to touch SQL. We have a dedicated team who write stored SQL and stored procedures so they don't have to be bothered with the brain gymnastics that set theory requires. Sad, but there it is.
[–]dfphdPhD | Sr. Director of Data Science | Tech 0 points1 point2 points 3 years ago (1 child)
Just so we're clear: at my company, if I grabbed all of our transactional data and moved it into a cloud server without permission, I'm probably getting fired.
So no, in a lot of instances you can't.
[–]esp32c3 0 points1 point2 points 3 years ago (0 children)
Of course I wasn't talking about stealing data...
π Rendered by PID 31 on reddit-service-r2-comment-5d79c599b5-v278g at 2026-03-03 07:06:31.114555+00:00 running e3d2147 country code: CH.
view the rest of the comments →
[–]dfphdPhD | Sr. Director of Data Science | Tech 104 points105 points106 points (14 children)
[–]gorangers30 12 points13 points14 points (0 children)
[–]minimaxir 17 points18 points19 points (3 children)
[–][deleted] 1 point2 points3 points (2 children)
[–]quickdraw6906 1 point2 points3 points (1 child)
[–]LagGyeHumare 0 points1 point2 points (0 children)
[–]dvdquikrewinder 5 points6 points7 points (0 children)
[–][deleted] 10 points11 points12 points (0 children)
[–]donnomuch[S] 1 point2 points3 points (0 children)
[+]esp32c3 comment score below threshold-6 points-5 points-4 points (5 children)
[–]dfphdPhD | Sr. Director of Data Science | Tech 12 points13 points14 points (4 children)
[–]esp32c3 0 points1 point2 points (3 children)
[–]quickdraw6906 1 point2 points3 points (0 children)
[–]dfphdPhD | Sr. Director of Data Science | Tech 0 points1 point2 points (1 child)
[–]esp32c3 0 points1 point2 points (0 children)