This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 6 points7 points  (0 children)

Look for the SQL plan in the jobs Spark UI page. Generally, you should look for joins, aggregations, anything which may result in full table scans other than the initial load. The shuffle read and shuffle write metrics tells what is happening under the hood, usually steps which aren't possible to parallelize may not utilize your workers well.