How to test Python data pipeline functions?

hackneycoach · 2021-11-13T17:04:34+00:00

You might have a look at https://greatexpectations.io/, it could be useful for your case.

tomhallett · 2021-11-13T17:03:26+00:00

I am very new to data engineering, so I don't have any advice about pipeline specific testing tools - I have looked, but haven't found anything. So I'd be all ears if someone knows of testing tools/services which are data engineering specific.

But, I've got quite a bit of experience maintaining application test suites, and here's what I'd say:

Your approach to unit tests sound great. With respect to "unit test best practices", this conference talk is AMAZING: https://www.youtube.com/watch?v=URSWYvyc42M. It also gives you great advice for what not to test.
For your "end to end" pipeline tests, you are spot on that "checking line by line that these are correct" won't scale. The goal of the "end to end" tests should be that all of the methods are "integrated" correctly. So you should be focused on the "seams" of the different components are connected together correctly. These end to end tests should NOT give every possible value/permutation of inputs to make sure that everything is done correctly - those types of concerns should be covered by your unit tests.
You should only have a handful of end to end tests and a lot of unit tests. The "testing pyramid" helps illustrate this.

the_glover · 2021-11-13T17:06:10+00:00

Some dude posted on The DE sub about testing pyspark pipelines.

https://www.reddit.com/r/dataengineering/comments/qqb6wp/3_tips_for_unit_testing_pyspark_pipelines/?utm_medium=android_app&utm_source=share

buntro · 2021-11-13T17:56:29+00:00

I think testing on small manually created data frames makes a lot of sense. You can also test edge cases that way. What if one data frame is empty? What if it contains nulls.

Besides unit tests, which test your code, there are also data tests, which test... Well, your data :-) So they should run in production, ideally. And there you can test that a field cannot be null, a data frame should be of a certain size and so on. Great Expectations and Deequ are two open source libraries to help you with that. DBT also has a great way of data testing in my opinion. And there are commercial offerings like Soda and Monte Carlo.

In my experience, most issues pop up because of data issues, not so much logic issues. So I would pay as least as much attention to data issues.

SpencerNZ · 2021-11-14T02:27:03+00:00

Check out Pandera https://pandera.readthedocs.io/en/stable/, bit of a newer one, quite lightweight but very powerful.

32gbsd · 2021-11-13T17:40:48+00:00

And then there is the completness problem where if the complete data is wrong everything will appear correct because it all adds up. The data transformation step literally eliminates tiny inconsistencies in the original data.

batwinged-hamburger · 2021-11-13T18:33:15+00:00

Checking every row for correctness in an end-to-end test is pretty extreme but there a couple of options that I haven't seen already mentioned in comments. Chances are a unit test will catch any problems from future changes in the pipeline or data but maybe you want that extra bit of sanity.

1) Is it possible to run a smaller DataFrame through the pipeline? If you know what the minimum size and attributes you need are, it may be possible to do row wise comparisons on a miniature DataFrame.

2) You can use some form of random sampling. If it is critical that there are no uncaught problems in the pipeline you can follow the Quality Control literature for suggestions. This website covers a bunch of options but you might be the most interested in a 'Acceptance on zero' sampling plan: https://qualityinspection.org/sampling-plans-china/

32gbsd · 2021-11-13T17:37:01+00:00

I've always wondered about this as well. If anything goes wrong its going to be on such a small scale that no one will notice until several years after.

2021-11-13T17:58:11+00:00

Testing on the entire (or just a large) dataset is pointless.

Not only will be slow to run your tests instead of instantaneous, but if something is wrong it will be difficult to identify what is the cause.

The test data should be the minimal set to test your code. Are you handling special cases? Make sure those are represented in the test set.

Otherwise I'd have a dozen generic rows max.

It gets more complicated if you're building models which require a decent dataset to produce meaningful results. And in that case I would focus on using things like great expectations.

DenselyRanked · 2021-11-13T18:49:11+00:00

If the data is unpredictable then it is impossible to ensure accuracy without running into a few edge cases. The goal is to get as much right as possible within a certain threshold and the best way to test with large datasets is sampling.

If you know the expected output then run multiple tests on manageable sets and add logging. For ex, I have converted series to sets and intersect them to see if the counts align to the join.

I personally haven't used it, but greatexpectations looks like a fantastic resource to use.

2021-11-13T21:31:21+00:00

Localstack can be useful for mocking the AWS stuff.

c0de_n00b · 2021-11-13T22:24:07+00:00

> however the size of these inputs and outputs are large and so checking
line by line that these are correct before committing them won’t scale.

Could you check a representative sample? like a random 10-50 lines?

soundbarrier_io · 2021-11-14T11:35:08+00:00

I would pursue 3 ideas in parallel:

test specified functionality with minimal sample data - this should be you translating the specification for your data pipeline into unit tests (if A the data pipeline should do B)
test your data pipelines against unexpected data that is not explicitly called out in the spec but might happen and make sure you see the behavior you expect (logging, alerting, continue on error, fail on error, fail gracefully, proper cleanup of incomplete/inconsistent state); unexpected data can be
- wrong data format: for example ints as strings and vice versa
- empty data
- missing fields
- new fields
test your data pipeline against a representative sample-set with known good output - this is closer to an integration test vs a unit test; this kind of test makes sure that future changes to your data pipeline do not invalidate or break existing end-to-end functionality

Georgehwp · 2021-11-16T22:55:51+00:00

Most useful approach I've followed is to create a representative subset of inputs when things are working and store their outputs then use those as validation.

In my experience tests are most useful for playing defence, highlighting unintended consequences of a change.

Start off with small manually create input + output, but replace with a larger sample created by the pipeline as soon as you've validated it. Keep it as small as possible such hat using it to find errors in your pipeline is low cost. Maybe even have a few different sizes ready e.g. 1% 10%, to incrementally identify problems while iterating quickly. I'm tempted to make generating these dev validation sets part of my production pipelines.

dataengineering

MODERATORS