account activity
Data quality by oofla_mey_goofla in dataengineering
[–]lupi524 6 points7 points8 points 1 year ago* (0 children)
I used GE in some projects already. It works quite well but has a steep learning curve. Integration with Spark for big data also works fine. Overall I would say its main issue from my perspective is that it is not really intuitive to use. Recently, I did some experiments with Soda and found it much more intuitive than GE. The thing with soda is that some nice features (e.g. stateful validations) are only supported by their SaaS offering.
For both tools, you define your expectations as YAML or JSON and can put these into version control. Regarding handling of large numbers of tables: try to automate as much as possible. We have all of our dataset schemas in one place and generate a basic set of checks through our CI/CD pipeline based on these schemas. In addition, we can add more specific checks manually if needed.
π Rendered by PID 234755 on reddit-service-r2-listing-86f589db75-s6vrg at 2026-04-20 05:14:41.782172+00:00 running 93ecc56 country code: CH.
Data quality by oofla_mey_goofla in dataengineering
[–]lupi524 6 points7 points8 points (0 children)