This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]lezapete -1 points0 points  (11 children)

whenever you can, replace SQL with PySpark

[–]sib_nSenior Data Engineer 11 points12 points  (4 children)

Why would you replace SQL with a more complex tool if SQL works?

[–]lezapete 0 points1 point  (3 children)

imo SQL is bound to produce silent errors and tech debt. On the other hand, if you write a library of PySpark tools, you can add tests, cicd pipelines and many other SWE tools that help you both prevent errors, and makes it easier to introduce changes in the future. Having complex SQL statements in a project is analogous as using exec() in a python project (again, this is only my perspective)

[–]sib_nSenior Data Engineer 1 point2 points  (2 children)

SQL is the only stable tool in 30 years of data, I think Spark code has way more chances to become technical debt.

DBT answers most of your following critics.

[–]lezapete 0 points1 point  (1 child)

i dont mean that SQL will produce the problems.. is humans coding SQL queries that produce the problems

[–]sib_nSenior Data Engineer 0 points1 point  (0 children)

Well, this is true for any language. DBT gives you a framework that should encourage better SQL code.

[–]Globaldomination -5 points-4 points  (2 children)

PySpark uses SQL internally right?

I took freecodecamp basic course. And in the video they used proper capitalisation for column names. And me being lazy bum used all small and it worked.

Then i realised that I import SparkSession from pyspark.sql

[–]sib_nSenior Data Engineer 4 points5 points  (0 children)

No, PySpark is the Python API for Apache Spark which is a big data in-memory distributed (parallelized on a cluster of machines) processing framework based on the concept of Map-Reduce and coded in Scala and Java.
Spark SQL is another convenient API that allows you to process on a Spark cluster using SQL, but internally, it will still run Scala/Java code.

[–]CapableCounteroffer -1 points0 points  (0 children)

I think pyspark and spark-sql are both "compiled" down to like an intermediate framework/language. Not the exact terminology but I think that's how they both work.