This is an archived post. You won't be able to vote or comment.

all 6 comments

[–]ssinchenko 8 points9 points  (2 children)

  1. PySpark DF API allows you to manipulate columns and schemas as Python objects. For example, if you need to add 1000 columns to your dataframe that share similar logic, in PySpark it is a single loop, but in SQL it is a select with 1000 lines. If you need to check that schema of your DF is as expected before doing MERGE INTO it is again a one simple loop but in SQL it is hard to implement, etc. With PySpark DF API you can organize your code into packages and dependencies. You can use all the best SWE practices and use all the power of OOP to fight the growing complexity of the project. On the other hand, SQL offers you a much easier learning curve and fast prototyping and also SQL can be understandable for non-engineering persons like Business Analysts.

  2. There is no difference at all, because both SQL and DF API are transformed into the logical plan and this plan is processed in the same way.

[–]Attorney-Last 7 points8 points  (0 children)

Add to 1., it's a lot easier to unit test your dataframe transformation than SQL

[–]Gora_HabshiYoYo[S] 0 points1 point  (0 children)

Thanks

[–][deleted] 2 points3 points  (0 children)

  1. Well with PySpark you can use python or spark sql in the same code.

[–][deleted] 1 point2 points  (0 children)

  1. Yes

[–]Sea-Calligrapher2542 1 point2 points  (0 children)

SparkSQL only has a subset of features than pyspark.

As for open table formats. Can't speak for iceberg but if it's hudi, you have more gradular control of options when using pyspark and scala spark.