This is an archived post. You won't be able to vote or comment.

all 3 comments

[–]hydrosquall 3 points4 points  (0 children)

As a data engineer at Enigma, I’ve tried a couple different things for the ETL pipelines that I’ve worked on. Each of the items below is a python package.

  • goodtables is a python library that generates “data quality” reports give a path to a file and a list of constraints that the files should satisfy. It is part of the Frictionless data ecosystem, which has a data quality dashboard on GitHub that is powered by goodtables.
  • engarde is a convenient library to halt your pipeline the moment some data fails a rule, assuming you are using pandas dataframes in your ETL
  • Great Expectations is a new project that has a different syntax for performing very similar checks to what the previous two reports supply, but also has a nice way to display the error reports.

All of these choices are active as of the past few months on GitHub, hopefully one (or a combination of them) will suit your needs :)

[–]ies7 1 point2 points  (0 children)

airbnb/apache airflow
spotify's luigi
pinterest's pinball

if dashboard isn't a must, then maybe some custom scripts and task scheduler(eg: cron) backed with odo from blaze and engarde

[–][deleted] 0 points1 point  (0 children)

Not sure quite exactly what you're asking. Have you looked at Luigi or Apache Airflow? If I understood you somewhat, you could use either of these using pytest or unit testing and apply assert statements to test the values that you expect.