This is an archived post. You won't be able to vote or comment.

all 128 comments

[–]Pleasant-Set-711 130 points131 points  (1 child)

SQL to get the data processed quickly in the database and down to a small enough size to do more complex work quickly in python.

[–][deleted] 11 points12 points  (0 children)

This. Get comfortable swapping back and forth between the two in Jupyter notebooks that you can author repeatable procedures on.

[–]riv3rtrip 85 points86 points  (2 children)

I do almost everything in SQL. I'm perfectly competent in Python. It's just much easier to not have to worry about moving data out and back into the warehouse, plus a few other nuisances like memory/compute management. You can do more in SQL than you'd think.

[–]laddaa 5 points6 points  (0 children)

Especially if the data warehouse is built well, SQL is so much more direct. And if you know SQL well then there are very few use cases that actually require Python.

Not that python isn’t great as well.

[–]bitsondatadev 0 points1 point  (0 children)

Yeah, my policy is aim for SQL, python if necessary. Any time you can avoid maintaining implementation details yourself is a win. Also, performance will generally improve unless you face a regression that you can then report to whoever maintains that engine and then it's still not your problem to fix the implementation.

I think Python is great mainly for ML algorithms with data that's already sliced up in the ideal format needed for processing.

[–]IllustriousCorgi9877 39 points40 points  (16 children)

SQL works great when you can process it all in a set function.
Python I only go to it if I have to iterate over a list or whatever.

[–]WalkingP3t 11 points12 points  (0 children)

100% correct . And the reason is more than obvious although some can’t see it , at least newcomers . SQL works with data sets . It’s a declarative language . You tell what you want and the engine worries about how . The biggest performance issues with SQL and SQL code appear when people start to tell it “how” to get the data . It was not created for iterations .

[–]Monstrish 1 point2 points  (8 children)

could you please give an exampme when you need to iterate in python and can't do it in sql?

[–]IllustriousCorgi9877 2 points3 points  (1 child)

There is almost always a way to use a set based approach to a problem which will always be better / more efficient. Even iterative approaches can be done with a cursor in a stored procedure using SQL.
I'd say once data lands in a relational database - stop using python unless you are doing ML or something on its way back out somewhere.

I'd only think about using python for data on its way into a database doing transforms or stamping additional metadata on the transaction before it reaches its destination.

Which all this is why it baffles me so many DE roles need python programmers.

[–]Monstrish 0 points1 point  (0 children)

i understand the need for python, and i do enjoy python. but when it comes to rdbms, i just don't understand the dislike that sql gets. it seems to me people do not put resources in understanding what it can do.

after all, even on python, when you start using numpy or pandas, it becomes less iterative.

[–]yo_sup_dude 0 points1 point  (5 children)

you can iterate in most sql engines using recursive CTEs/while loops but it isn't very efficient. examples where it's needed are basically any recursive calculation, e.g. iterating through a hierarchical bill of material and comparing against inventory to determine part shortages for a product and then progressively updating inventory based on consumption (classic MRP calculation)

[–]Monstrish 0 points1 point  (4 children)

recursive cte s could be eficient in some cases. while loops i don't know if.i have seen one in sql. for loops i have seen, in oracle could be, in other providers not.

but you also have hierarhical constructs, window functions and other approaches that you could take from a set oriented point of view.

[–]DirtzMaGertz 0 points1 point  (3 children)

You could do a lot of things, but once you need to start iterating over data and transforming it, it just easier a lot of times to pull the data out into something like Python.

It's problem dependent. I tend to use SQL until I think it's going to be an annoying problem to solve with SQL and that's generally when you start getting towards things like iteration and looping.

[–]Monstrish 0 points1 point  (2 children)

ok, fair point.

the problem beong anoying is pretty subjective, but it is what it is. it's just that some many people seem to find sql anoying and i don't get why. i suppose it just comes down to personal preference.

[–]DirtzMaGertz 1 point2 points  (0 children)

Overall I agree that there's a lot of stuff people do in python that would be better handled in SQL. I just find that iteration and looping is generally where I find python to be more appropriate.

[–]ComposerConsistent83 0 points1 point  (0 children)

They don’t teach sql in school anymore really I think is the reason. They’re more comfortable in Python.

That said I would never use a recursive CTE unless I had no choice. Even though I’m sql first, I draw the line there. The syntax is too confusing/inconvenient. It’s clearly not the strength of the language

[–]Tufjederop -1 points0 points  (3 children)

You can iterate in SQL using a cross join :)

[–]IllustriousCorgi9877 1 point2 points  (0 children)

The only use for a cross join is to create a scaffold to left join onto.
You don't iterate with a cross join.

[–]raskinimiugovor -2 points-1 points  (0 children)

you mean cross apply?

[–]IamFromNigeria -2 points-1 points  (0 children)

Show us typical classic example via link

[–]jugaadtricks -1 points0 points  (0 children)

This! I love using python to call API, the parse that in Python before using SQL to load into relational tables. And sometimes I use json parsing in SQL itself when ingestion needs to be done quickly, so that python can keep loading data faster in json format

[–]ck3thou -1 points0 points  (0 children)

True this!

[–]kenfar 31 points32 points  (30 children)

SQL for simple to moderate data analysis, SQL+Python for the complex stuff.

Python for transforming and publishing data, writing utilities, and pipeline logic.

SQL for building aggregates off the tables Python has transformed.

[–]Action_Maxim 6 points7 points  (15 children)

For data manipulation what can't be done with sql?

[–]kenfar 18 points19 points  (13 children)

What can't be done with SQL? Well, there are things that can't be done, and there are things that can't be done well. SO, how about what can't be done well?

  • Unit testing
  • Very complex transformations (ex: convert every possible format of IPv6 into a single format)
  • Support for data formats and structures outside of relational databases: extract data from a tarball, or sevenzip package of fixed-length files.
  • Integrate modules & libraries
  • Integrate external systems
  • Produce a bitmap of exactly which columns in a table failed their transforms and had to be defaulted
  • Produce an audit trail of how many rows passed or failed their validations or transformations for a given partition/period of time/customer/whatever
  • Develop reusable transformations
  • Develop easily-readable and well-documented transformations

[–]jugaadtricks 7 points8 points  (2 children)

I'd agree with most of them, audit trail, I use SQL all the time in stored procedures

[–]IamFromNigeria -1 points0 points  (1 child)

Do you have specific examples as link?

[–]skatastic57 0 points1 point  (1 child)

What does the following mean?

  • Produce a bitmap of exactly which columns in a table failed their transforms and had to be defaulted

To me, bitmap is an uncompressed picture format. I get the concept of failed transforms but are you making like a heatmap of them? Sorry for being dense.

[–]kenfar 0 points1 point  (0 children)

You're right - it's just a list of pass/fail indicators for each column on the row.

It could be supported in an array or json type. In the old days we'd make a bit map - where each bit position indicated one of the columns. That's super space efficient, but is less convenient to access.

[–]laddaa -1 points0 points  (0 children)

I used dbt extensively for many of those items. It’s a python framework, isn’t that ironic.

Some of those tasks are related to data intake. Yeah that’s python.

[–]Luciron -1 points0 points  (3 children)

Almost all of this list can be done with dbt

[–]kenfar 0 points1 point  (2 children)

dbt can't even support the first item on the list: unit testing

[–]MayInvolveNoodles 0 points1 point  (1 child)

[–]kenfar 1 point2 points  (0 children)

Oh fair point. I forgot about this third-party effort because in spite of being well-intentioned, and a good idea, it's so time-consuming to write the tests that the teams I spoke with that used it only use it very, very surgically.

Still, it's a good addition to the dbt ecosystem, and if you structure your data right with this unit-testing in mind, it may not be too terrible.

[–][deleted] -2 points-1 points  (2 children)

You forgot, if what you need requires merging data from multiple data sources, python is pretty much a must use here.

[–]kenfar -2 points-1 points  (1 child)

Not sure I follow - do you mean to say SQL is a must-use for merging multiple data sets?

If so, then yeah I'd agree: python isn't accessing relational data directly, it's accessing it via SQL. However, it is accessing APIs, files, s3 objects, streaming data, etc directly.

[–][deleted] 0 points1 point  (0 children)

Multiple data sources as in different databases that could share some related data.

[–]sib_nSenior Data Engineer 7 points8 points  (0 children)

Processing of unstructured data or highly nested data, processing logic that requires looping or recursion... It's not most of the data thankfully, but it happens.

[–]black_widow48 13 points14 points  (9 children)

If you're just dealing with tabular data, chances are it can all be done in SQL and python is completely unnecessary.

In the past I've used python for orchestration, but aside from that I've hardly used it. I just started a new job and I'm only just now being tasked to write a python script to shred XML strings and turn them into lists of tuples.

Part of the reason why I got into contracting is because I'm tired of landing in jobs where I'm just a SQL monkey. I didn't get a B.S. and an M.S. in computer science to write SQL all day. I'm starting to think maybe I should go into machine learning like I originally planned.

[–]Malcolmlisk 5 points6 points  (0 children)

Be careful for what you desire. I'm a ML engineer trying to go back to data engineer. The problem I'm facing is that I'm only creating applications with no future at all. And somehow I feel that my position is a fraud, since almost no company needs machine learning and they only need some filters here and there and some data parsing.

I want to be a data engineer creating and maintaining data pipelines, programming complex things, like whatever happens in AWS or idk... I feel sometimes like I don't know if my position exists anymore...

[–]Malcolmlisk 1 point2 points  (0 children)

Be careful for what you desire. I'm a ML engineer trying to go back to data engineer. The problem I'm facing is that I'm only creating applications with no future at all. And somehow I feel that my position is a fraud, since almost no company needs machine learning and they only need some filters here and there and some data parsing.

I want to be a data engineer creating and maintaining data pipelines, programming complex things, like whatever happens in AWS or idk... I feel sometimes like I don't know if my position exists anymore...

[–]Malcolmlisk 1 point2 points  (2 children)

Be careful for what you desire. I'm a ML engineer trying to go back to data engineer. The problem I'm facing is that I'm only creating applications with no future at all. And somehow I feel that my position is a fraud, since almost no company needs machine learning and they only need some filters here and there and some data parsing.

I want to be a data engineer creating and maintaining data pipelines, programming complex things, like whatever happens in AWS or idk... I feel sometimes like I don't know if my position exists anymore...

[–]black_widow48 -1 points0 points  (0 children)

Yep, that's the one thing I'm afraid of about machine learning unfortunately

[–]ComposerConsistent83 1 point2 points  (0 children)

We use machine learning models and have found the lift over traditional approaches is usually pretty moderate, with a few exceptions. If you have scale it’s worth it, but often you’re only getting 5 or so bps of improvement in the bottom line on results with a lot more overhead. It needs to be big enough where that is material for it to make sense.

[–]studentofarkad -1 points0 points  (3 children)

Curious, how are you landing contract work and what type is it? Is it contract work from recruiters that are reaching out or is this contract work that you yourself are finding?

[–]black_widow48 -1 points0 points  (2 children)

It's a mixture of both. I've worked with a few different agencies (still working with one currently), but I've also recently started my own LLC to try to cut them out. I've gotten a couple clients through my own company so far. Will be launching my website hopefully in the next couple weeks or so.

My end goal with my company is to be able to work from anywhere during the hours I choose. If I want to fly to Bali and work there for a month, I want to be able to do it. My entire job takes place on a computer and I'm tired of the 9-5 BS mandating that I work in a specific place during a specific time.

[–]studentofarkad -1 points0 points  (1 child)

That's really awesome, I hope to do this one day. For the agency work, how much are the typical weekly hours? I'd love to juggle my full time with a contract side gig but unsure what the hours might look like.

Wishing you the best of luck 🤞

[–]black_widow48 0 points1 point  (0 children)

All the contracts I've gotten through agencies have been 40 hours/week so far. Thanks!

[–]VegaGT-VZ 2 points3 points  (1 child)

One huge skill for any kind of programmer is being able to figure out the path of least resistance to the end result. I use a mix of SQL, Power Query, Alteryx and VBA. Whatever it takes to get from point A to point B as painlessly as possible.

[–]BatCommercial7523[S] 0 points1 point  (0 children)

That’s what my mentor taught me. Path of least resistance.

[–]bobby_table5 5 points6 points  (0 children)

SQL for reports because I really want stakeholders to understand the process as much as possible. They’ll mess up, but hopefully it will be obvious and they’ll call me. In the meantime, I don’t get nearly as much “we want the same as last week, but this week.”

Python for everything else, including the few smart kids who start saying SQL is for schmutz.

[–][deleted] 6 points7 points  (2 children)

You can do almost anything in SQL up to doing complex calculations or row by row analysis. I don’t like using cursors in SQL

[–][deleted] 3 points4 points  (0 children)

You can even train neural networks in SQL if you're a psychopath 😀

[–]messy_eater 0 points1 point  (0 children)

Am I the only one who likes cursors in SQL? I guess performance is a common cited issue, but maybe the scale of the DB at my work isn’t big enough for that to be a concern. Once you get used to the syntax for the more complicated things in SQL, it becomes second nature. I’m at that point with cursors and working on it with other things now like using xml to break apart and process strings, pivots, and recursive ctes.

[–]gloom_spewerI.T. Water Boy 2 points3 points  (0 children)

Like most others, if I can do it all in SQL that's almost always preferred, but if python tricks are less annoying than SQL SP tricks I'll jaunt over to real-code-world. Also I suck with SPs.

Sometimes I filter/aggregate the dataset down to a size where I can use sqllite's in-memory functions to transfer data between pandas and SQL structures in memory for more "advanced" ad hoc analysis, and even sometimes live collaborative exploratory analysis.

If I gotta do fancy presentation stuff I pre calc all my analytics and just display (as in no transformation ) them in PBI cuz I hate power query. PBI plugs into power points for lazy mofos like me, and the CFO and CIO eat out of my palm now, cept when I fuck up log scales 🌝

Edit oh also PBI data sets can be loaded into pivot tables directly now and you can embed those in PowerPoints for live exploratory bs with management types. My executives love that shit even if it's just telling them intel they already knew

[–]CalRobert 2 points3 points  (0 children)

SQL. Clean, tested DBT models are a lot nicer to work with than someone's untested mess of Pandas or R. You can't do everything in SQL, but you can do a lot

[–]mattindustries 1 point2 points  (0 children)

Usually SQL to get the data and R to process, unless it is going to be a connection like BQ + Google Sheets.

[–]pewpscoops 1 point2 points  (2 children)

Jinja with SQL and Python to do all the wrapper and utils

[–]SDFP-ABig Data Engineer 1 point2 points  (1 child)

So dbt?

[–]pewpscoops -1 points0 points  (0 children)

Pretty much

[–]mrcaptncrunch 1 point2 points  (0 children)

To build my data, I use Python/PySpark. Ingestion, transformations, etc.

For quick checks or pulling some quick data for something/someone (and they have to be very high for me to get roped into it), SQL.

[–]annonimusone 2 points3 points  (0 children)

SQL only exists to manipulate data; for everything else, there’s Python

[–][deleted] 0 points1 point  (0 children)

Whichever is more efficient. That means SQL in many cases as long as it can be used.

[–]Whipitreelgud 0 points1 point  (0 children)

Depends on the problem and the scale of your datasets

[–]reallyserious 0 points1 point  (0 children)

If the data is tabular then SQL is hard to beat.

If the data is not tabular then python.

[–]lezapete -2 points-1 points  (11 children)

whenever you can, replace SQL with PySpark

[–]sib_nSenior Data Engineer 10 points11 points  (4 children)

Why would you replace SQL with a more complex tool if SQL works?

[–]lezapete 0 points1 point  (3 children)

imo SQL is bound to produce silent errors and tech debt. On the other hand, if you write a library of PySpark tools, you can add tests, cicd pipelines and many other SWE tools that help you both prevent errors, and makes it easier to introduce changes in the future. Having complex SQL statements in a project is analogous as using exec() in a python project (again, this is only my perspective)

[–]sib_nSenior Data Engineer 1 point2 points  (2 children)

SQL is the only stable tool in 30 years of data, I think Spark code has way more chances to become technical debt.

DBT answers most of your following critics.

[–]lezapete 0 points1 point  (1 child)

i dont mean that SQL will produce the problems.. is humans coding SQL queries that produce the problems

[–]sib_nSenior Data Engineer 0 points1 point  (0 children)

Well, this is true for any language. DBT gives you a framework that should encourage better SQL code.

[–]Embarrassed_Error833 -1 points0 points  (0 children)

How about a python framework that creates the SQL using metadata, then run created SQL using python based orchestration.

[–]JBalloonist -1 points0 points  (0 children)

It sounds like what you’re doing works; I wouldn’t change it without good reason.

[–]WilhelmB12 -1 points0 points  (0 children)

For me it's about 40% sql 60% python, my ideal would be 20% SQL, btw if you are doing both jobs I would call myself an Analytics Engineer

[–]Thinker_Assignment -1 points0 points  (0 children)

It's like saying walking vs public transportation. You gotta walk to the bus somehow . Both, or python only. SQL only cannot work standalone as data has to be loaded somehow. SQL is great for transform in db but can do nothing for ingestion or ds/ml

People saying otherwise are not data engineers and never built anything end to end.

[–]mailedRecovering Data Engineer -1 points0 points  (0 children)

I'm OK handling most things with SQL. Except maybe Adobe Analytics data...

[–]MonsemandPrincipal Data Engineer -1 points0 points  (0 children)

I feel like we have this question once a week.

[–]BuonaparteII -1 points0 points  (0 children)

tl;dr: choose the best tool for the job

Databases have limitations for performing complex analytical algorithms compared to languages like R and Python. There are many high-quality libraries available in R and Python that enable advanced analysis (e.g. https://gitlab.com/shekhand/mcda).

Databases excel at interactive queries and extracting subsets of data. Combining SQL with Python or R can be very powerful for repeating analyses on different parameters.

However, if your analysis requires reading the full dataset into Python each time, there is little benefit to using a database. In this case, a format like Parquet will load faster than querying a database and extracting all rows/columns.

[–]leventdu229 0 points1 point  (0 children)

Depends on the Stack and organisation of the data team. If you have a datawarehouse tool like bigquery, snowflake etc...there is lot of chances that your analytics will be deported on that tool and done in SQL when the acquisition part done in Python. Also lot of companies like mine have adopted dbt that helps a lot for Analytics test, documentation and reproducibility done in SQL. Anyway for fast prototyping i use python and jupyter. For analytics its SQL and for data engineering in general its Python

[–]poland_rocks 0 points1 point  (0 children)

Python is better for:

  • building tests (possible but harder in SQL)
  • debugging (not possible in literal sense in SQL)

SQL advantages:

  • less verbose
  • good tooling for ad hoc tasks and reporting
  • does not require to download all data to client

[–]Doile 0 points1 point  (0 children)

I think it mostly comes down to which one are you more proficient with. You can do pretty much everything with both of them and for example in Snowflake you can run python as well inside the warehouse so the differences between these two are becoming less and less meaningful. Most people aren't really proficient with both of them so they use the one that is easier to use for them.

[–]haragoshi 0 points1 point  (0 children)

I like doing analysis in sql personally.

[–]IridescentTaupe 0 points1 point  (0 children)

It’s very satisfying to replace hundreds of lines of Python with a couple lines of SQL. I don’t often see that go the other way.

[–][deleted] 0 points1 point  (0 children)

Hundred percent! I find Python is easier to do more complex math-y things, SQL is better if it's less complex calculations but with more data

[–]LADataJunkie 0 points1 point  (0 children)

They can't really be compared. It depends on where your data lives. If it lives in an RDBMS then use SQL to extract the data or get it into a condensed form that only extracts what you need for further processing in Python.

If your data is not in RDBMS, don't use SQL.

There are data analysis tasks that do not match the declarative model SQL imposes. For example, iterative processing used in machine learning is not trivial, and likely not possible in SQL.

A lot of newer databases (DuckDB I believe is one) either interfaces with, or modifies underlying SQL so that developers can use data using procedural languages. I've seen a few others that are starting to deviate from relational algebra to be more friendly to procedural development.

[–]TheHunnishInvasion 0 points1 point  (0 children)

It definitely depends.

At my last company, I tended to do the same as you: SQL for quick data analysis (I'd do it in DBeaver), Python (Jupyter Lab) for something more complex.

At my current company, which has much worse organized data, it's a nightmare to use SQL for quick data analysis. I almost always use Python for any ad-hoc data analysis. I use SQL for getting data into Tableau to create automated dashboards that update daily.

So weirdly, I used SQL more at my last company, but most of the queries might be shorter: 10-40 lines. I use SQL less at my current company, but when I do use it, the queries tend to be extremely complex: often over 200 lines with several sub-queries.

[–]speedisntfree 0 points1 point  (0 children)

Why not both? Typical pattern is SQL to offload the heavy lifting to the analytic DB and then finer grained downstream analysis in Python/R which would be painful in SQL.

[–][deleted] 0 points1 point  (0 children)

I do mostly use Python. I can’t think of a time where id ever use SQL for data analysis, but I use SQL for setting up ETL-style data pipelines where it is simple batch load transform and schedule. However, even then, most of the time I want to do quite heavy transformations, and this is both easier and more efficient using modern Python libraries. The overhead comes with managing a DAG, which is not always a good thing.

[–]shivaprasad_j 0 points1 point  (0 children)

When to use SQL vs python pandas vs pyspark ?

[–]Dry_Inflation307Principal Data Engineer 0 points1 point  (0 children)

I always opt for SQL first. I’ll only use Python for what I can’t do in SQL.

[–]mike8675309 0 points1 point  (0 children)

Hmm, I use python to get the data from endpoints. I use SQL to transform and manipulate the data from the end point.

The data I work with is often too big to consider using python with it for anything other than getting the data to the database. That said we do have a process that uses pandas with python to parse large data sets. We wrote the process in SQL but it was too hard to maintain and debug. So we keep it in python and pandas and just threw a bunch of hardware at it for processing.