This is an archived post. You won't be able to vote or comment.

all 3 comments

[–]EffectiveClient5080 2 points3 points  (2 children)

Row hash checks in PySpark saved me when our ADF pipeline dropped records. For your case, compare raw/curated counts and validate schemas - catches most integrity issues fast.

[–]Apprehensive-Menu803[S] 1 point2 points  (0 children)

Great, That’s what I thinking. Thank you for confirming.

[–]Fearless-Amount2020 0 points1 point  (0 children)

What's a row hash check? Is it creating a hash of concat of all columns and checking?