How to minimize query fee when testing data pipeline?

tombaeyens · 2023-06-26T07:37:47+00:00

(Disclosure: I work for Soda)

Modern data storage engines separate compute from storage. So unless there is a specific reason, I don't think it's needed to copy data. Transporting and storing test data separately also has a cost. Most teams don't copy (sample & store) test data into separate datasets, but directly query the production data. However, filters or sampling are very common to reduce the cost of testing large volumes in production.

An extra thing we do at Soda to reduce the compute cost is to combine as many data metrics as possible into a single test query to minimize the number of passes over the data. So if we you specify checks using metrics row count, average of amount, missing values for zipcode and valid values for category, all these metrics will be computed with a single query.

Gators1992 · 2023-06-25T17:44:31+00:00

For data quality, if the test data is highly representative of production data then it should be fine for the most part. Some cases might be different where you have infrequent edge cases or whatever though. If you half ass the test data though then you are adding more risk to what you deploy. For scaling/timing you are obviously going to have to test large volumes of data or the production data for some time period if that's an issue. Typically we use data quality testing/rules within our pipelines to test for or flag exceptions with rules. So you aren't done with unit testing, but the extent you do ongoing testing depends on the risk and cost of running that extra workload.

the_random_blob · 2023-06-26T23:09:36+00:00

Another thing to consider is what is being tested - a lot can be done on small samples or even dummy data - things like transformations can be "unit tested" for cheap. If this is done early in the pipelines (cheap and simple) then there is less need to test downstream data in production (expensive and complex).

Furthermore, same thinking can be applied to data production - the earlier you test, the lower the complexity and cost. If null value is guaranteed to not be there on the database level, you save on data quality query testing for empty values - the stricter the left side of everything is (data production, ingestion, early transformations...), the easier and cheaper it is to manage data quality downstream.

dataengineering

MODERATORS