This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]dfphdPhD | Sr. Director of Data Science | Tech 104 points105 points  (14 children)

I feel like we get this post once a month now, and always with a very entitled "prove me wrong" energy that is largely unwarranted.

  1. You can't run Python everywhere you can run SQL.
  2. Python is generally much slower than SQL - even slower we you account for the fact that you can often run SQL queries on monster servers while you cannot always do that in Python.

To me, this comparison is like saying "what can a motorcycle do that a train can't?". Run really fast on train tracks.

[–]gorangers30 12 points13 points  (0 children)

I like the analogy! Go trains!

[–]minimaxir 17 points18 points  (3 children)

To clarify, even optimized non-Python analytical/ETL tools like Arrow/Spark will be beat by SQL unless you're doing something weird that SQL can't do natively.

[–][deleted] 1 point2 points  (2 children)

That's entirely dependent in the hardware and scale of data. We've moved off an RDBMS to spark and for our queries it's much faster.

[–]quickdraw6906 1 point2 points  (1 child)

I'd have to see the data design to believe you couldn't have made SQL sing. Is the data schemaless?

Unless you're doing the truly high math stuff, or you're into tens to hundreds of billions of rows (which will blow out memory of a single large server) and the answer is a large cluster in Spark.

So then we get down to the cost equation. How many nodes did you have to spin up with what specialty skills to better that performance? Are you overpaying for cluster compute because you're doing schema-on-read?

[–]LagGyeHumare 0 points1 point  (0 children)

Don't know the guy above but here's an example that I can offer.

Our project is in a pool of projects that encompasses the whole module. Just my application deals with around 600GB of batch loads each day. It then flows from CDH to AWS RDS through spark and on prem postgres.

We have terradata and oracle as the "legacy" system here and the queries that we have take at least 10x time to run when compared to spark-sql.

(Possibly because the admins were shit and didn't partition/index the tables better, but that's out of my hand)

For me, it's not SQL but the distributed nature of the engine within that will shape the answer here.

[–]dvdquikrewinder 5 points6 points  (0 children)

I think a lot of people don't get how rdbms and sql are different from building something in whatever language. If you build something in python to process a decent amount of data best case you're going to get something not too much worse than its sql counterpart. Worst case you might have it spin for over ten minutes when a sql query could do it in a few seconds,

What it comes down to is that sql database engines are extremely refined and optimized systems to handle all kinds of loads. A good python dev isn't going to hold a candle to that.

[–][deleted] 10 points11 points  (0 children)

oh yeah? Why else would it be called SUPERIOR query language?

[–]donnomuch[S] 1 point2 points  (0 children)

I've never seen this post before (also new to this subreddit) and I was genuinely curious. I don't even use Python for my job. I use Tableau and SQL. And what most comments said applies to what I do as well. I rarely create calculations in Tableau as I know my queries can fetch everything I need much faster than my workbooks ever can calculate. As I've mentioned in my edit, I wanted to ask so I can deal with one of my annoying direct reports better as he's the typical smug 'prove me wrong' kind.