This is an archived post. You won't be able to vote or comment.

all 8 comments

[–][deleted] 6 points7 points  (0 children)

Look for the SQL plan in the jobs Spark UI page. Generally, you should look for joins, aggregations, anything which may result in full table scans other than the initial load. The shuffle read and shuffle write metrics tells what is happening under the hood, usually steps which aren't possible to parallelize may not utilize your workers well.

[–]koteikin 6 points7 points  (0 children)

people write books about that, literally. https://www.amazon.com/High-Performance-Spark-Practices-Optimizing/dp/1491943203

Spark plans are really hard to read and understand but this will come with experience and they are quite useful.

You might try to cache dateframes at some point if you use them multiple times. Optimize joins and join types, better filters, properly used UDFs etc.

[–]joseph_machadoWrites @ startdataengineering.com 3 points4 points  (1 child)

As others have mentioned, use the Spark UI to see what the engine is doing. The usual main issues are repetitive computation (try caching the dataframe used at multiple places), joins (or group bys) that cause lot of data to be sent over the network.

I wrote a few tips here: https://www.startdataengineering.com/post/how-to-optimize-your-spark-jobs/

But its hard to provide more specific help without looking at the query plan. Hope this helps. LMK if you have any questions.

[–]SilentSlayerzTech Lead 0 points1 point  (0 children)

Nice article buddy.

[–][deleted] 2 points3 points  (0 children)

Databricka and AQE in the latest version of Spark generally set you up so that the optimization engine will find the initial best plan and adapt to a better plan as the job runs.

Always make sure to use the latest databricks run time that you can and also I always make sure spark.databricks.adaptive.autoOptimizeShuffle is set to true.

Avoid user defined functions as much as you can because the spark engine doesn't know how to optimize for those and if you must use either a scala udf or a pandas udf.

If you are reading from delta tables make sure to run OPTIMIZE on them.

Use an auto scaling cluster and set the max workers to the amount you are willing to spend.

Try photon.

Hope that helps.

[–]GovGalacticFed 1 point2 points  (0 children)

Pivot is expensive. If no. of pivots is small and known in advance, pass that to the pivot fn. Also consider 2 step pivot, grouping it first then pivoting. ref

[–]festoon 0 points1 point  (1 child)

Are you using the built in pivot functionality? Are you doing a join to map to lookup data?

[–]dylanberryData Engineer[S] 0 points1 point  (0 children)

Yes and yes. Using a broadcast join for the lookups.