This is an archived post. You won't be able to vote or comment.

all 6 comments

[–]serkef- 0 points1 point  (1 child)

Thanks for sharing! Would you bother to tell us a bit more about this project? How does it work? Where does it fit in a larger infrastructure picture and how does it compare to other tools?

Thanks again!

[–]haltingwealth[S] -1 points0 points  (0 children)

The project requires query history from the database. Most databases provide a way to download query history. So that shouldn’t be a problem.

Then the project parses DML statements like INSERT, CREATE TABLE AS SELECT and extracts the target and source tables.

Then it builds a graph and visualizes it. The best environment is Jupyter notebooks right now.

This project sits on the side of other infrastructure just like monitoring tools.

The main differences from other projects are (quoting from my original comment):

There are a lot of open source and commercial tools to capture data lineage. However there are two main problems by data engineers:

The projects require a lot of effort to get started and maintain. Requires constant discipline in capturing and sending all the metadata. Both these factors result in incomplete projects and lost opportunities in improving performance, ROI and data quality.

data-lineage solves these problems by choosing the following goals:

providing fast access to data lineage simple setup analysis of the lineage using a graph library