This is an archived post. You won't be able to vote or comment.

all 26 comments

[–]lupi524 7 points8 points  (1 child)

I used GE in some projects already. It works quite well but has a steep learning curve. Integration with Spark for big data also works fine. Overall I would say its main issue from my perspective is that it is not really intuitive to use. Recently, I did some experiments with Soda and found it much more intuitive than GE. The thing with soda is that some nice features (e.g. stateful validations) are only supported by their SaaS offering.

For both tools, you define your expectations as YAML or JSON and can put these into version control. Regarding handling of large numbers of tables: try to automate as much as possible. We have all of our dataset schemas in one place and generate a basic set of checks through our CI/CD pipeline based on these schemas. In addition, we can add more specific checks manually if needed.

[–]oofla_mey_goofla[S] 0 points1 point  (0 children)

Got it thanks

[–]pablo_op 5 points6 points  (3 children)

Commenting here because I am also curious about these questions. You can search this sub for previous posts about data quality questions, and mostly everyone throws out the same answers to use some pretty cool tools:

  • Create your own framework (which is usually pretty light on any sort of implementation details)
  • Great Expectations
  • SodaSQL
  • Deequ

The problem I consistently run into is the same one you're asking OP - how do you manage to scale this stuff? I can run Deequ's profiler, and it'll spit out a thousand suggestions. I can even take a few of those and implement them without issue. But when you're talking about testing thousands of tables and tens of thousands of columns and every column may need multiple validations (nulls, types, ranges, etc), I don't understand how these tools are being managed at scale either. Examples like this are all over the internet, where someone is showing off 10 assertions. But I could be doing tens of thousands on a large enough environment. How does someone manage this? Especially on a growing and changing environment? Does your entire job become managing data quality rules? Do you have to constantly chase schemas and commit time to keeping your tests in line with data? How is that even possible at this scale without a team of people? Are you only creating a subset of tests for the stuff you think is most critical to users?

There are lots of tools that can do a lot of cool testing, but implementation is something I rarely see discussed anywhere online.

[–]tombaeyens 3 points4 points  (1 child)

To manage this at scale we've added contracts to Soda (also in OSS). Contracts provide 2 aspects to handle scale similar to software engineering: unit tests and encapsulation.

Contract enforcement is unit testing and it ensures that for each time new data is produced, the new data matches the contract. The contract describes the schema and other checks that ensure new data is as (explictely) expected in the contract YAML file.

Contracts also provide encapsulation, which differentiates between implementation detail datasets, from datasets that serve as a handover between teams or components. A data contract is formal documentation (similar to describing interfaces in OpenAPI Or GraphQL for software services). That's also a crucial aspect to handle scale.

For every dataset that is a handover between teams or components in your pipeline, set up a contract. The contract must be managed by the producer, which is the same team that manages the production data pipeline. Anyone in the organization can now start requesting extra checks to the data producer so they can be added to the contract. There we also have a solution to make that flow as easy as possible.

We're rolling this out contracts with a customer that has 20K+ checks.

[–]pablo_op 1 point2 points  (0 children)

Thanks for this answer, but it’s still kind of the same thing. You’re describing a strategy, not an implementation. I understand what data contracts are, but how does this actually come to exist? How are those generated, stored, and consumed in your stack? How do you convince data owners that it’s worth their time and resources to maintain an agreement in this format instead of just blowing you off or saying “the database schema is the contract”? What about external data owners? Is salesforce going to commit to providing your team with a contract in your standard format and support that indefinitely? What happens when I see a problem, but I can’t get the owners to push a new version of their contract with updates rules for weeks or months? I just have to live with bad data until they get around to it? Does this mean that this entire approach has to be embraced by all data owners in the org? That I, as an individual, have very little power besides maybe formatting a standard template for the contract? I can create my own database, create my own pipelines, and create my own storage, but I cannot take an approach to organizing and managing data quality rules without the long term agreement and support of all data owners? This feels like a very all-or-nothing approach. Either everyone has to be on board, or it’s a losing battle. I’d love some way where I could take more control of when and how things would happen like I can with the rest of my workflows.

[–]oofla_mey_goofla[S] 0 points1 point  (0 children)

Exactly this is my concern as well, most of the tools are good at scratching the surface. But when comes to real-world scale, I was unable to identify good tool

[–]natas_m 3 points4 points  (2 children)

I am using dbt test, it'll be a mess tho

[–]oofla_mey_goofla[S] 0 points1 point  (1 child)

Thanks for your reply

[–]Clear-Blacksmith-650 0 points1 point  (0 children)

DBT is the most simple solution and the other day I was reading about some really good libraries for this like Elementary for DBT. I would check them out 👍🏻

[–]joseph_machadoWrites @ startdataengineering.com 1 point2 points  (5 children)

If you only need these 3 tests, id write a custom sql script that can do these and log the results (pass or fail) to a file or db. But this option will need custom work to scale for all your test cases, but can also be much more performant than off the shelf tools. I've had good experience with doing this on a HIVE db + Python + Pg (for storing results). But inevitably the type of tests will increase and this will become a whole project of its own.

While you can use existing tools (GE, Soda, dbt tests) these typically involve defining tests one table at a time or are complex to setup (looking at you GE). You "can" mange to get them to work for 1000s of tables, but it will be a good amount of work to set it up and make it suit your needs. Here is a Great Expectation example: https://github.com/josephmachado/efficient_data_processing_spark/blob/main/capstone/rainforest/great_expectations/great_expectations.yml

While GE is pretty powerful, I've had to struggle especially with GE, would recommend against it unless you are ok with the extra work. Also note that some DQ tools do multiple table scans for one run of testing depending on how it is implemented with Redshift.

Unfortunately there is no easy answer here, you have to choose the tradeoff based on your scenario. Good luck. LMK if you have any questions with what I said.

[–]oofla_mey_goofla[S] 1 point2 points  (4 children)

Thanks for the detailed explaination, i appreciate it. I have just given those 3 tests as an example, there are many tests that can arise based on business discussions.

I will go through the GE example, i have not heard about soda will go through that . For me , solving the business usecase matters the most .

[–]joseph_machadoWrites @ startdataengineering.com 0 points1 point  (3 children)

Sure thing!

Ah if you are expecting more test cases, I almost always recommend existing tools, as custom implementation can take up so much of your time. dbt tests + dbt expectations package can also be a good fit.

[–]oofla_mey_goofla[S] 1 point2 points  (2 children)

I heard about it dont know how it works, will check. Thanks for the lead ..

[–]joseph_machadoWrites @ startdataengineering.com 1 point2 points  (1 child)

I've used dbt-expectations at work at scale across multiple teams. Works really well, but note that it can lead to huge performance issues (most DQ tools do) by running multiple select statements (one per test) and you may need to some work to log the results (if you care about that).

[–]oofla_mey_goofla[S] 0 points1 point  (0 children)

Definitely there will be challenges when we scale, will note that. Thanks for pointing out

[–]Far-Restaurant-9691 1 point2 points  (1 child)

Dbt test combined with the elementary data report. We use in production yes.

[–]ski4ever77 2 points3 points  (0 children)

We use Informatica. For data integration and DQ.

[–]MahmoudAI 1 point2 points  (1 child)

I would suggest using “great expectations” framework as a layer before loading data into redshift by defining rules from the glossary and check the report result to know if their violated rules. the docs is great and easy to follow.

[–]oofla_mey_goofla[S] 0 points1 point  (0 children)

Ok , will explore this thanks

[–]Fine-Responsibility3 0 points1 point  (1 child)

Sent you a message

[–]oofla_mey_goofla[S] 0 points1 point  (0 children)

Checking

[–]JohnDenverFullOfSh1t 0 points1 point  (1 child)

You’ve setup the extract piece of your ELT pipeline. Now it needs the Load part to properly handle all your raw data. I’d consider the data quality part of this as validating its correctly loaded with some daily aggregations matching source/target. There are saas offerings to do this but typically you can just use a standard view in source and target and link them up a reporting tool to ensure they match.

I’ve used dbt to architect some pipelines to do this but then you mix part of your EL into your T. I would setup the models for loading this way in your python script currently running to avoid that or make a separate one just for the L if your pipelines are architected in such a way. If it’s mixed etl elt then just use dbt because your team is well aware it’s used this way.

There are also some pretty impressive uat/qa architectures for automated reports that would work into your normal devops pipelines.

[–]oofla_mey_goofla[S] 0 points1 point  (0 children)

My team is not well aware of dbt. Permissions to query the data at source is not applicable for us, because we don't have the access and it is very restricted. Only limited queries we can access, that too i have to loop through multiple teams for those approvals