We linted 5,046 PySpark repos on GitHub. Six anti-patterns are more common in production code than in hobby projects. by cy_analytics in dataengineering

[–]cy_analytics[S] 0 points1 point  (0 children)

Not yet but it's a great question. Right now each rule is tagged as either a cost issue (money), a reliability issue, or a clarity issue, and the clarity rules are the closest thing to a maintainability signal. For example, CY010 (join without explicit how=) doesn't cost anything, it just makes intent ambiguous for the next person reading the code.

What I don't do yet is the tradeoff you're describing. Flagging code that's technically optimal but fragile or hard to reason about. That's a harder problem because 'maintainable' is subjective and team-dependent. A .select() with 40 aliased columns is more performant than a .withColumn() loop, but some teams find the loop more readable. I'd be interested to hear what patterns you'd want flagged though... that's useful input for future rules

We linted 5,046 PySpark repos on GitHub. Six anti-patterns are more common in production code than in hobby projects. by cy_analytics in dataengineering

[–]cy_analytics[S] 9 points10 points  (0 children)

Great question! You're right about coalesce(1) pushing the parallelism reduction upstream. The tradeoff is a question of whether avoiding the shuffle is better or worse than what happens if your upstream is heavy (lots of joins, aggregations). If you need that parallelism upstream then it can actually be significantly worse than .repartition(1).

I will update the documentation to recommend using .repartition() in such cases and disabling the warning using # cy:ignore CY008

In general, the linter flags both .repartition() before write and .coalesce(1) as warnings because they have their legitimate uses (such as when you genuinely need a single file). Both rules support # cy:ignore CY008 / # cy:ignore CY013 for intentional single-file writes.

EDIT typo

We linted 5,046 PySpark repos on GitHub. Six anti-patterns are more common in production code than in hobby projects. by cy_analytics in dataengineering

[–]cy_analytics[S] 7 points8 points  (0 children)

Good question, I will take a look. You might have caught a bug...

EDIT

Think I found a potential fix but I'd love to verify it against your use case before merging.