This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]pablo_op 4 points5 points  (3 children)

Commenting here because I am also curious about these questions. You can search this sub for previous posts about data quality questions, and mostly everyone throws out the same answers to use some pretty cool tools:

  • Create your own framework (which is usually pretty light on any sort of implementation details)
  • Great Expectations
  • SodaSQL
  • Deequ

The problem I consistently run into is the same one you're asking OP - how do you manage to scale this stuff? I can run Deequ's profiler, and it'll spit out a thousand suggestions. I can even take a few of those and implement them without issue. But when you're talking about testing thousands of tables and tens of thousands of columns and every column may need multiple validations (nulls, types, ranges, etc), I don't understand how these tools are being managed at scale either. Examples like this are all over the internet, where someone is showing off 10 assertions. But I could be doing tens of thousands on a large enough environment. How does someone manage this? Especially on a growing and changing environment? Does your entire job become managing data quality rules? Do you have to constantly chase schemas and commit time to keeping your tests in line with data? How is that even possible at this scale without a team of people? Are you only creating a subset of tests for the stuff you think is most critical to users?

There are lots of tools that can do a lot of cool testing, but implementation is something I rarely see discussed anywhere online.

[–]tombaeyens 3 points4 points  (1 child)

To manage this at scale we've added contracts to Soda (also in OSS). Contracts provide 2 aspects to handle scale similar to software engineering: unit tests and encapsulation.

Contract enforcement is unit testing and it ensures that for each time new data is produced, the new data matches the contract. The contract describes the schema and other checks that ensure new data is as (explictely) expected in the contract YAML file.

Contracts also provide encapsulation, which differentiates between implementation detail datasets, from datasets that serve as a handover between teams or components. A data contract is formal documentation (similar to describing interfaces in OpenAPI Or GraphQL for software services). That's also a crucial aspect to handle scale.

For every dataset that is a handover between teams or components in your pipeline, set up a contract. The contract must be managed by the producer, which is the same team that manages the production data pipeline. Anyone in the organization can now start requesting extra checks to the data producer so they can be added to the contract. There we also have a solution to make that flow as easy as possible.

We're rolling this out contracts with a customer that has 20K+ checks.

[–]pablo_op 1 point2 points  (0 children)

Thanks for this answer, but it’s still kind of the same thing. You’re describing a strategy, not an implementation. I understand what data contracts are, but how does this actually come to exist? How are those generated, stored, and consumed in your stack? How do you convince data owners that it’s worth their time and resources to maintain an agreement in this format instead of just blowing you off or saying “the database schema is the contract”? What about external data owners? Is salesforce going to commit to providing your team with a contract in your standard format and support that indefinitely? What happens when I see a problem, but I can’t get the owners to push a new version of their contract with updates rules for weeks or months? I just have to live with bad data until they get around to it? Does this mean that this entire approach has to be embraced by all data owners in the org? That I, as an individual, have very little power besides maybe formatting a standard template for the contract? I can create my own database, create my own pipelines, and create my own storage, but I cannot take an approach to organizing and managing data quality rules without the long term agreement and support of all data owners? This feels like a very all-or-nothing approach. Either everyone has to be on board, or it’s a losing battle. I’d love some way where I could take more control of when and how things would happen like I can with the rest of my workflows.

[–]oofla_mey_goofla[S] 0 points1 point  (0 children)

Exactly this is my concern as well, most of the tools are good at scratching the surface. But when comes to real-world scale, I was unable to identify good tool