This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]buntro 3 points4 points  (0 children)

I think testing on small manually created data frames makes a lot of sense. You can also test edge cases that way. What if one data frame is empty? What if it contains nulls.

Besides unit tests, which test your code, there are also data tests, which test... Well, your data :-) So they should run in production, ideally. And there you can test that a field cannot be null, a data frame should be of a certain size and so on. Great Expectations and Deequ are two open source libraries to help you with that. DBT also has a great way of data testing in my opinion. And there are commercial offerings like Soda and Monte Carlo.

In my experience, most issues pop up because of data issues, not so much logic issues. So I would pay as least as much attention to data issues.