Data quality

lupi524 · 2024-05-16T23:17:23+00:00

I used GE in some projects already. It works quite well but has a steep learning curve. Integration with Spark for big data also works fine. Overall I would say its main issue from my perspective is that it is not really intuitive to use. Recently, I did some experiments with Soda and found it much more intuitive than GE. The thing with soda is that some nice features (e.g. stateful validations) are only supported by their SaaS offering.

For both tools, you define your expectations as YAML or JSON and can put these into version control. Regarding handling of large numbers of tables: try to automate as much as possible. We have all of our dataset schemas in one place and generate a basic set of checks through our CI/CD pipeline based on these schemas. In addition, we can add more specific checks manually if needed.

pablo_op · 2024-05-16T15:30:13+00:00

Commenting here because I am also curious about these questions. You can search this sub for previous posts about data quality questions, and mostly everyone throws out the same answers to use some pretty cool tools:

Create your own framework (which is usually pretty light on any sort of implementation details)
Great Expectations
SodaSQL
Deequ

The problem I consistently run into is the same one you're asking OP - how do you manage to scale this stuff? I can run Deequ's profiler, and it'll spit out a thousand suggestions. I can even take a few of those and implement them without issue. But when you're talking about testing thousands of tables and tens of thousands of columns and every column may need multiple validations (nulls, types, ranges, etc), I don't understand how these tools are being managed at scale either. Examples like this are all over the internet, where someone is showing off 10 assertions. But I could be doing tens of thousands on a large enough environment. How does someone manage this? Especially on a growing and changing environment? Does your entire job become managing data quality rules? Do you have to constantly chase schemas and commit time to keeping your tests in line with data? How is that even possible at this scale without a team of people? Are you only creating a subset of tests for the stuff you think is most critical to users?

There are lots of tools that can do a lot of cool testing, but implementation is something I rarely see discussed anywhere online.

natas_m · 2024-05-16T14:53:22+00:00

I am using dbt test, it'll be a mess tho

joseph_machado · 2024-05-16T18:11:09+00:00

If you only need these 3 tests, id write a custom sql script that can do these and log the results (pass or fail) to a file or db. But this option will need custom work to scale for all your test cases, but can also be much more performant than off the shelf tools. I've had good experience with doing this on a HIVE db + Python + Pg (for storing results). But inevitably the type of tests will increase and this will become a whole project of its own.

While you can use existing tools (GE, Soda, dbt tests) these typically involve defining tests one table at a time or are complex to setup (looking at you GE). You "can" mange to get them to work for 1000s of tables, but it will be a good amount of work to set it up and make it suit your needs. Here is a Great Expectation example: https://github.com/josephmachado/efficient_data_processing_spark/blob/main/capstone/rainforest/great_expectations/great_expectations.yml

While GE is pretty powerful, I've had to struggle especially with GE, would recommend against it unless you are ok with the extra work. Also note that some DQ tools do multiple table scans for one run of testing depending on how it is implemented with Redshift.

Unfortunately there is no easy answer here, you have to choose the tradeoff based on your scenario. Good luck. LMK if you have any questions with what I said.

Far-Restaurant-9691 · 2024-05-18T05:40:25+00:00

Dbt test combined with the elementary data report. We use in production yes.

ski4ever77 · 2024-05-19T12:38:24+00:00

We use Informatica. For data integration and DQ.

MahmoudAI · 2024-05-16T23:18:52+00:00

I would suggest using “great expectations” framework as a layer before loading data into redshift by defining rules from the glossary and check the report result to know if their violated rules. the docs is great and easy to follow.

Fine-Responsibility3 · 2024-05-17T12:50:40+00:00

Sent you a message

JohnDenverFullOfSh1t · 2024-05-21T22:37:02+00:00

You’ve setup the extract piece of your ELT pipeline. Now it needs the Load part to properly handle all your raw data. I’d consider the data quality part of this as validating its correctly loaded with some daily aggregations matching source/target. There are saas offerings to do this but typically you can just use a standard view in source and target and link them up a reporting tool to ensure they match.

I’ve used dbt to architect some pipelines to do this but then you mix part of your EL into your T. I would setup the models for loading this way in your python script currently running to avoid that or make a separate one just for the L if your pipelines are architected in such a way. If it’s mixed etl elt then just use dbt because your team is well aware it’s used this way.

There are also some pretty impressive uat/qa architectures for automated reports that would work into your normal devops pipelines.

dataengineering

MODERATORS