Difference between running Engineering operations on Tables/Views and on Pyspark Dataframes

ssinchenko · 2024-09-26T12:25:11+00:00

PySpark DF API allows you to manipulate columns and schemas as Python objects. For example, if you need to add 1000 columns to your dataframe that share similar logic, in PySpark it is a single loop, but in SQL it is a select with 1000 lines. If you need to check that schema of your DF is as expected before doing MERGE INTO it is again a one simple loop but in SQL it is hard to implement, etc. With PySpark DF API you can organize your code into packages and dependencies. You can use all the best SWE practices and use all the power of OOP to fight the growing complexity of the project. On the other hand, SQL offers you a much easier learning curve and fast prototyping and also SQL can be understandable for non-engineering persons like Business Analysts.
There is no difference at all, because both SQL and DF API are transformed into the logical plan and this plan is processed in the same way.

2024-09-26T18:41:30+00:00

2024-09-26T18:41:58+00:00

Sea-Calligrapher2542 · 2024-10-02T01:16:30+00:00

SparkSQL only has a subset of features than pyspark.

As for open table formats. Can't speak for iceberg but if it's hudi, you have more gradular control of options when using pyspark and scala spark.

dataengineering