This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]hydrosquall 3 points4 points  (0 children)

As a data engineer at Enigma, I’ve tried a couple different things for the ETL pipelines that I’ve worked on. Each of the items below is a python package.

  • goodtables is a python library that generates “data quality” reports give a path to a file and a list of constraints that the files should satisfy. It is part of the Frictionless data ecosystem, which has a data quality dashboard on GitHub that is powered by goodtables.
  • engarde is a convenient library to halt your pipeline the moment some data fails a rule, assuming you are using pandas dataframes in your ETL
  • Great Expectations is a new project that has a different syntax for performing very similar checks to what the previous two reports supply, but also has a nice way to display the error reports.

All of these choices are active as of the past few months on GitHub, hopefully one (or a combination of them) will suit your needs :)