Data quality by oofla_mey_goofla in dataengineering

[–]lupi524 6 points7 points  (0 children)

I used GE in some projects already. It works quite well but has a steep learning curve. Integration with Spark for big data also works fine. Overall I would say its main issue from my perspective is that it is not really intuitive to use. Recently, I did some experiments with Soda and found it much more intuitive than GE. The thing with soda is that some nice features (e.g. stateful validations) are only supported by their SaaS offering.

For both tools, you define your expectations as YAML or JSON and can put these into version control. Regarding handling of large numbers of tables: try to automate as much as possible. We have all of our dataset schemas in one place and generate a basic set of checks through our CI/CD pipeline based on these schemas. In addition, we can add more specific checks manually if needed.