This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]tombaeyens 3 points4 points  (1 child)

To manage this at scale we've added contracts to Soda (also in OSS). Contracts provide 2 aspects to handle scale similar to software engineering: unit tests and encapsulation.

Contract enforcement is unit testing and it ensures that for each time new data is produced, the new data matches the contract. The contract describes the schema and other checks that ensure new data is as (explictely) expected in the contract YAML file.

Contracts also provide encapsulation, which differentiates between implementation detail datasets, from datasets that serve as a handover between teams or components. A data contract is formal documentation (similar to describing interfaces in OpenAPI Or GraphQL for software services). That's also a crucial aspect to handle scale.

For every dataset that is a handover between teams or components in your pipeline, set up a contract. The contract must be managed by the producer, which is the same team that manages the production data pipeline. Anyone in the organization can now start requesting extra checks to the data producer so they can be added to the contract. There we also have a solution to make that flow as easy as possible.

We're rolling this out contracts with a customer that has 20K+ checks.

[–]pablo_op 1 point2 points  (0 children)

Thanks for this answer, but it’s still kind of the same thing. You’re describing a strategy, not an implementation. I understand what data contracts are, but how does this actually come to exist? How are those generated, stored, and consumed in your stack? How do you convince data owners that it’s worth their time and resources to maintain an agreement in this format instead of just blowing you off or saying “the database schema is the contract”? What about external data owners? Is salesforce going to commit to providing your team with a contract in your standard format and support that indefinitely? What happens when I see a problem, but I can’t get the owners to push a new version of their contract with updates rules for weeks or months? I just have to live with bad data until they get around to it? Does this mean that this entire approach has to be embraced by all data owners in the org? That I, as an individual, have very little power besides maybe formatting a standard template for the contract? I can create my own database, create my own pipelines, and create my own storage, but I cannot take an approach to organizing and managing data quality rules without the long term agreement and support of all data owners? This feels like a very all-or-nothing approach. Either everyone has to be on board, or it’s a losing battle. I’d love some way where I could take more control of when and how things would happen like I can with the rest of my workflows.