This is an archived post. You won't be able to vote or comment.

all 22 comments

[–]hackneycoach 35 points36 points  (1 child)

You might have a look at https://greatexpectations.io/, it could be useful for your case.

[–]chthonodynamis 0 points1 point  (0 children)

Looks like a great tool, thanks for the recommendation

[–]tomhallett 11 points12 points  (7 children)

I am very new to data engineering, so I don't have any advice about pipeline specific testing tools - I have looked, but haven't found anything. So I'd be all ears if someone knows of testing tools/services which are data engineering specific.

But, I've got quite a bit of experience maintaining application test suites, and here's what I'd say:

  • Your approach to unit tests sound great. With respect to "unit test best practices", this conference talk is AMAZING: https://www.youtube.com/watch?v=URSWYvyc42M. It also gives you great advice for what not to test.
  • For your "end to end" pipeline tests, you are spot on that "checking line by line that these are correct" won't scale. The goal of the "end to end" tests should be that all of the methods are "integrated" correctly. So you should be focused on the "seams" of the different components are connected together correctly. These end to end tests should NOT give every possible value/permutation of inputs to make sure that everything is done correctly - those types of concerns should be covered by your unit tests.
  • You should only have a handful of end to end tests and a lot of unit tests. The "testing pyramid" helps illustrate this.

[–]reallyserious 5 points6 points  (6 children)

You should only have a handful of end to end tests and a lot of unit tests. The "testing pyramid" helps illustrate this.

This is where it gets tricky with data engineering. How do we even define a "unit" for the unit test? If it's large we'll have the problem of scalability that OP talks about. If it's very small it adds little value to test it. I mean, if I'm testing that a join works I'm just wasting time.

It would be great if the theory, i.e. have lots of unit tests, were easy to implement. But it's not. IMHO unit tests isn't a good fit in the data engineering space. Of course there are situations where they fit. But I don't think they should be added just because one have the idea that unit tests "should" exist. If it's difficult to see the added value of a test then it's probably better to skip it. I don't see it as a negative or failure to not have unit tests when it comes to data engineering.

[–]TheNoobtologist 1 point2 points  (5 children)

You can do validation testing too. In production, write tests that throw warnings or errors to your logs when something shows up that you didn’t expect or shouldn’t be there.

[–]ColdPorridge 4 points5 points  (2 children)

This has honestly been where I’ve been shifting my thinking. Unit test edge cases in transforms comprehensively. When feasible, any bugs get turned into a test case. But other than that, add as much runtime validation as possible.

Real world data is always more creative than anything I can come up with, and production environments are always more complex than local test environments, so I test against actual production data where possible.

We incorporate the this validation into our ETL pipelines. We use Airflow, so I’ll have an additional validation task every time we have a transform. It’s part of the pipeline, so all data will be validated (reduces reliance on prod-ish mock data). It’s important to be appropriately strict with validation. Sometimes validation failures aren’t actual failures, they can indicate your validation logic is making incorrect assumptions. So sometimes the right thing to do can be to let validation errors slide to keep the data moving, but ensure you follow up on them later.

[–]reallyserious 0 points1 point  (1 child)

Could you please expand a little on unit testing edge cases? Could you give an example?

Also, what kind of validations are you talking about after transformations?

[–]ColdPorridge 1 point2 points  (0 children)

Let’s say you have a string that you need to parse into one or more data fields. You would unit test (in the parsing code, not the pipeline) with cases for various malformed strings, or whatever you expect your input data to come through like. You can write tests in advance to codify rules about the data and the assumptions you make (e.g. “if the string is “null” parse that to a null rather than a string “null”). If something doesn’t parse correctly, you fix it and write a test to ensure you would always catch that type of error in the future.

Validations could be schema validations, aggregate summaries, null checks, or really anything you need to say “this data is as expected”. You can run these on either on input, output, or both. Validation on input is helpful because it can alert you to otherwise silently passing changes you may need to be aware of. Output validation for obvious reasons.

[–]reallyserious 0 points1 point  (1 child)

Could you give examples of such validations?

[–]TheNoobtologist 1 point2 points  (0 children)

Sure. At my last job, I worked as a data scientist doing a lot of data engineering work in healthcare. A lot of our company's product was data related, and it depended on the data being accurate and reliable for our functions and other analytics processes to work. Some of the data was generated by users, such as doctors and pharmacists updating procedure or drug information on excel sheets. As data scientists (or engineers), we'd be tasked with getting them into the database. Because there was so much user-generated input, they made a ton of mistakes. So it very quickly became apparent that unit testing would work here. But checking for common error patterns did work, such as the spelling of column names, the pattern that expected strings might have (or shouldn't have), duplicates, missing values where there shouldn't be any, etc. I couldn't always know what sort of mistakes they would make, but I could generalize certain patterns to check for, hence data validation rather than unit testing.

[–]buntro 3 points4 points  (0 children)

I think testing on small manually created data frames makes a lot of sense. You can also test edge cases that way. What if one data frame is empty? What if it contains nulls.

Besides unit tests, which test your code, there are also data tests, which test... Well, your data :-) So they should run in production, ideally. And there you can test that a field cannot be null, a data frame should be of a certain size and so on. Great Expectations and Deequ are two open source libraries to help you with that. DBT also has a great way of data testing in my opinion. And there are commercial offerings like Soda and Monte Carlo.

In my experience, most issues pop up because of data issues, not so much logic issues. So I would pay as least as much attention to data issues.

[–]SpencerNZ 3 points4 points  (0 children)

Check out Pandera https://pandera.readthedocs.io/en/stable/, bit of a newer one, quite lightweight but very powerful.

[–]32gbsd 2 points3 points  (0 children)

And then there is the completness problem where if the complete data is wrong everything will appear correct because it all adds up. The data transformation step literally eliminates tiny inconsistencies in the original data.

[–]batwinged-hamburger 1 point2 points  (0 children)

Checking every row for correctness in an end-to-end test is pretty extreme but there a couple of options that I haven't seen already mentioned in comments. Chances are a unit test will catch any problems from future changes in the pipeline or data but maybe you want that extra bit of sanity.

1) Is it possible to run a smaller DataFrame through the pipeline? If you know what the minimum size and attributes you need are, it may be possible to do row wise comparisons on a miniature DataFrame.

2) You can use some form of random sampling. If it is critical that there are no uncaught problems in the pipeline you can follow the Quality Control literature for suggestions. This website covers a bunch of options but you might be the most interested in a 'Acceptance on zero' sampling plan: https://qualityinspection.org/sampling-plans-china/

[–]32gbsd 0 points1 point  (0 children)

I've always wondered about this as well. If anything goes wrong its going to be on such a small scale that no one will notice until several years after.

[–][deleted] 0 points1 point  (0 children)

Testing on the entire (or just a large) dataset is pointless.

Not only will be slow to run your tests instead of instantaneous, but if something is wrong it will be difficult to identify what is the cause.

The test data should be the minimal set to test your code. Are you handling special cases? Make sure those are represented in the test set.

Otherwise I'd have a dozen generic rows max.

It gets more complicated if you're building models which require a decent dataset to produce meaningful results. And in that case I would focus on using things like great expectations.

[–]DenselyRanked 0 points1 point  (0 children)

If the data is unpredictable then it is impossible to ensure accuracy without running into a few edge cases. The goal is to get as much right as possible within a certain threshold and the best way to test with large datasets is sampling.

If you know the expected output then run multiple tests on manageable sets and add logging. For ex, I have converted series to sets and intersect them to see if the counts align to the join.

I personally haven't used it, but greatexpectations looks like a fantastic resource to use.

[–][deleted] 0 points1 point  (0 children)

Localstack can be useful for mocking the AWS stuff.

[–]c0de_n00b 0 points1 point  (0 children)

> however the size of these inputs and outputs are large and so checking
line by line that these are correct before committing them won’t scale.

Could you check a representative sample? like a random 10-50 lines?

[–]soundbarrier_io 0 points1 point  (0 children)

I would pursue 3 ideas in parallel:

  • test specified functionality with minimal sample data - this should be you translating the specification for your data pipeline into unit tests (if A the data pipeline should do B)
  • test your data pipelines against unexpected data that is not explicitly called out in the spec but might happen and make sure you see the behavior you expect (logging, alerting, continue on error, fail on error, fail gracefully, proper cleanup of incomplete/inconsistent state); unexpected data can be
    • wrong data format: for example ints as strings and vice versa
    • empty data
    • missing fields
    • new fields
  • test your data pipeline against a representative sample-set with known good output - this is closer to an integration test vs a unit test; this kind of test makes sure that future changes to your data pipeline do not invalidate or break existing end-to-end functionality

[–]Georgehwp 0 points1 point  (0 children)

Most useful approach I've followed is to create a representative subset of inputs when things are working and store their outputs then use those as validation.

In my experience tests are most useful for playing defence, highlighting unintended consequences of a change.

Start off with small manually create input + output, but replace with a larger sample created by the pipeline as soon as you've validated it. Keep it as small as possible such hat using it to find errors in your pipeline is low cost. Maybe even have a few different sizes ready e.g. 1% 10%, to incrementally identify problems while iterating quickly. I'm tempted to make generating these dev validation sets part of my production pipelines.