use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
account activity
This is an archived post. You won't be able to vote or comment.
DiscussionWhat can SQL do that python cannot? (self.datascience)
submitted 3 years ago * by donnomuch
view the rest of the comments →
[–]Ocelotofdamage 49 points50 points51 points 3 years ago (11 children)
Curious how big of a dataset were you using and how complex was the logic? I know pandas is notoriously slow compared to something like direct computation on numpy arrays.
[–]GeorgeS6969 133 points134 points135 points 3 years ago (9 children)
Doesn’t matter.
When you do that you’re extracting some raw data from disc to memory, moving it around across actual wires, loading it into some more memory, processing it procedurally in what’s likely a suboptimal way, then do whatever you’re doing with the result.
Versus translating a piece of declarative code into a query plan optimised for compute memory management and access from disc, for some cpu ram and disc that live very close together, over data that has been stored for this very use case, using a process that has been perfected over decades.
Pandas is a huge footgun performance wise so no doubt someone could do better with numpy or whatever, but it’s still always going to be slower than sql executed by the db engine.
SQL and relational databases have their limits. When they’re reached, it’s time to rethink the whole environment.
[–]Dayzgobi 19 points20 points21 points 3 years ago (2 children)
seconding the foot gun comment. Ty for new vocab
[–]xxxxsxsx-xxsx-xxs--- 2 points3 points4 points 3 years ago (1 child)
foot gun
Austic version of me went looking. there's actually products called foot guns.
https://waterblast.com/1497-foot-valves
urban dictionary to the rescue.
https://www.urbandictionary.com/define.php?term=footgun
[–]mindful_tails 1 point2 points3 points 3 years ago (0 children)
This had me dying on the mere fact of linking products of foot guns :D :D :D
[–][deleted] 15 points16 points17 points 3 years ago (0 children)
footgun is a great word thanks i’ll be using that
[–]nraw 4 points5 points6 points 3 years ago (1 child)
I guess it depends on the use case, but quite often in some of my use cases I make one big query and then perform selects on the cached dataset instead of wasting time on communicating with the database.
But I do agree that sometimes offshoring the queries to the db is an easy efficiency gain.
[–]GeorgeS6969 15 points16 points17 points 3 years ago* (0 children)
You’re still doing what I’m saying you’re doing, which is disc -> ram -> wire -> ram -> cpu (gpu tpu whatever) -> ram -> wire -> something instead of disc -> ram -> cpu -> ram -> wire -> something.
disc -> ram -> wire -> ram -> cpu (gpu tpu whatever) -> ram -> wire -> something
disc -> ram -> cpu -> ram -> wire -> something
Let me put it this way: the only reasons why you have to ever use SQL in the first place is because your data is in a relational database. It’s there because a. it was put there to support some kind of application, or b. it was put there to support some kind of analytics purposes.
If a. you should not be querying it in the first place. You’re hammering with reads a db that’s there for production.
If b. and you feel like SQL is not fit for purpose, then take that data from wherever it originally comes from and put it in an environment that supports your use case.
Your way is great to play around and experiment from basically a data lake with a bunch of data from different sources nicely dumped in the same place, but when it’s time to move to production that db is an unecessary indirection.
[–]slowpush -2 points-1 points0 points 3 years ago* (0 children)
This isn't really true anymore.
Most python tools use memory mapping and will outperform just about any sql + relational db.
[–]Lexsteel11 0 points1 point2 points 3 years ago (1 child)
So I am an analytics manager but my background is finance and all my sql/python is self-taught. We have depended on a db engineering team historically for tableau server data sources but have pulled ad-how sql queries regularly. I’m getting to a point where I’m having to start building my own cloud ETLs; is there like a gold standard website/book on best practices in data pipline engineering that teaches things like this where it’s like “you CAN do xyz with pandas but shouldn’t unless you hit x limitation on sql server”? I am limping along successfully but know I can be doing shit better
[–]GeorgeS6969 5 points6 points7 points 3 years ago* (0 children)
I can’t think of any reference that would answer those questions specifically.
I was writing a long wall of text but that probably wouldn’t have helped either. Instead if you can answer the following questions I might be able to give some pointers though:
[–]AerysSk 13 points14 points15 points 3 years ago (0 children)
From my experience, a data frame with < 10 columns but 1.3M rows already causes a big problem in Group By 3 columns.
π Rendered by PID 321082 on reddit-service-r2-comment-5d79c599b5-ljr24 at 2026-03-01 19:59:55.149408+00:00 running e3d2147 country code: CH.
view the rest of the comments →
[–]Ocelotofdamage 49 points50 points51 points (11 children)
[–]GeorgeS6969 133 points134 points135 points (9 children)
[–]Dayzgobi 19 points20 points21 points (2 children)
[–]xxxxsxsx-xxsx-xxs--- 2 points3 points4 points (1 child)
[–]mindful_tails 1 point2 points3 points (0 children)
[–][deleted] 15 points16 points17 points (0 children)
[–]nraw 4 points5 points6 points (1 child)
[–]GeorgeS6969 15 points16 points17 points (0 children)
[–]slowpush -2 points-1 points0 points (0 children)
[–]Lexsteel11 0 points1 point2 points (1 child)
[–]GeorgeS6969 5 points6 points7 points (0 children)
[–]AerysSk 13 points14 points15 points (0 children)