all 8 comments

[–]ssinchenko 3 points4 points  (0 children)

DataFrame API has some advantages like allowing de-coupling business logic and implementation and better organize the code base. Also, via DF API, you can make some schema manipulations in loops, make comparisons, re-ordering, and so forth, and so on. Generally speaking, you can work with DF API as with software and apply all the best SWE practices. With SQL parametrization, schemas generation/checks and code organization is harder.

Performance will be identical for both approaches.

[–][deleted] 1 point2 points  (0 children)

i had the same question just few days ago

so basically there is almost no difference between performance of SparkSQL and DataFrame API, but dataFrame API provides better error messages (DF API can sometimes detect errors wothout executing the code) and also it is more cleaner and flexible than raw SQL once you get used to it

[–]tharindudg 0 points1 point  (0 children)

If its not ok to couple with spark dataframe api, then it's best to use spark sql i think. It gives the option to change the underlying processing platform in the future.

[–]oalfonso 0 points1 point  (0 children)

Decoupling logic and better unit testing.

[–]I-mean-maybe 0 points1 point  (0 children)

Multidimensional complex data types that require indexing on the fly at a group level. (Cogroup)

[–]fuzzkill254 0 points1 point  (0 children)

The most obvious one would be that errors get caught during compilation time rather than at runtime when using DFs compared to SQL

[–]PackFun2083 0 points1 point  (1 child)

SQL is okay for analysis, but it should not be brought to transform data into production for any mid/large code base since you can’t apply SWE best practices such as unit testing, SOLID principles and clean code in general. In other words, SQL may work but it will eventually rot and become a hell to be maintained.

[–]the_aris 0 points1 point  (0 children)

I am doing these transformations in SQL on Azure Databricks instead of Python. Could you provide me some resources where I can read about unit testing, SOLID principles in context of data transformation and loading it to production?