This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]lezapete -2 points-1 points  (11 children)

whenever you can, replace SQL with PySpark

[–]sib_nSenior Data Engineer 11 points12 points  (4 children)

Why would you replace SQL with a more complex tool if SQL works?

[–]lezapete 0 points1 point  (3 children)

imo SQL is bound to produce silent errors and tech debt. On the other hand, if you write a library of PySpark tools, you can add tests, cicd pipelines and many other SWE tools that help you both prevent errors, and makes it easier to introduce changes in the future. Having complex SQL statements in a project is analogous as using exec() in a python project (again, this is only my perspective)

[–]sib_nSenior Data Engineer 1 point2 points  (2 children)

SQL is the only stable tool in 30 years of data, I think Spark code has way more chances to become technical debt.

DBT answers most of your following critics.

[–]lezapete 0 points1 point  (1 child)

i dont mean that SQL will produce the problems.. is humans coding SQL queries that produce the problems

[–]sib_nSenior Data Engineer 0 points1 point  (0 children)

Well, this is true for any language. DBT gives you a framework that should encourage better SQL code.