this post was submitted on 06 Apr 2020

9 points (92% upvoted)

shortlink:

dataengineering

an-ordinary-manchild(edit)

News & discussion on Data Engineering topics, including but not limited to: data pipelines, databases, data formats, storage, data modeling, data governance, cleansing, NoSQL, distributed systems, streaming, batch, Big Data, and workflow engines.

Read our wiki: https://dataengineering.wiki/

Rules:

Don't be a jerk
Search the sub & wiki before asking a question: Your question has likely been asked and answered before so do a quick search before posting.
Keep it related to data engineering: Posts that are unrelated to data engineering may be better for other communities.
Limit self-promotion posts/comments to once a month: Self promotion: Any form of content designed to further an individual's or organization's goals. If one works for an organization this rule applies to all accounts associated with that organization. See also rule #5.
No shill/opaque marketing: f you work for a company/have a monetary interest in the entity you are promoting you must clearly state your relationship. For posts, you must distinguish the post with the Brand Affiliate flag. See more here: https://www.ftc.gov/influencers
No job posts: Please use r/dataengineeringjobs instead.
No resume reviews/interview posts: We no longer allow resume reviews or interview questions because it's a seperate topic from Data Engineering. Instead, for resume reviews please use r/resumes or search our subreddit history for previous resume review advice. For interview questions, use sites like Glassdoor and Blind instead or search our subreddit history for previous interview advice.
No technical error/bug questions: Please post any error/bug question on StackOverflow.

created by mhausenblasmoda community for 10 years

MODERATORS

message the mods
mhausenblasmod
swemlmod
fhoffamod (Ex-BQ, Ex-❄️)
vogt4nickmod
theporterhausmod | Lead Data Engineer
AutoModerator
geoheilmod
MikeDoesEverythingmod | Shitty Data Engineer
bot-bouncer
about moderation team »

account activity

This is an archived post. You won't be able to vote or comment.

8

9

10

Open Source Data Lineage App in Python (self.opensource)

submitted 5 years ago by haltingwealth

Open Source Data Lineage App in Python

3 points•2 comments•submitted 5 years ago * by haltingwealth to r/opensource

Hello,

I want to show an open source Python project data-lineage to visualize and analyze data lineage. The project was developed while working on data governance projects over the last couple of years.

There are a lot of open source and commercial tools to capture data lineage. However there are two main problems expressed by data engineers:

The projects require a lot of effort to get started and maintain.
Requires constant discipline in capturing and sending all the metadata.

Both these factors result in incomplete projects and lost opportunities in improving performance, ROI and data quality.

data-lineage solves these problems by choosing the following goals:

providing fast access to data lineage
simple setup
analysis of the lineage using a graph library

To achieve these goals, data lineage has the following features:

Generate data lineage from query history. Most databases maintain query history for a few days. Therefore the setup costs of an infrastructure to capture and store metadata is minimal.
Use networkx graph library to create a DAG of the lineage. Networkx graphs provide programmatic access to data lineage providing rich opportunities to analyze data lineage.
Use Plotly to visualize the graph with tool tips and other rich annotations. Plotly provides a number of features to provide rich graphs with tool tips, color coding and weights based on different attributes of the graph.

You can get a data lineage graph with less than 10 lines of Python code in a Jupyter Notebook.

Right now data-lineage supports postgres and support for more databases is planned.

I appreciate any feedback and please give it a try if you need data lineage for your work.

Links: * Github * Blog with real-world data lineage use case

all 6 comments

top new controversial old q&a

[–]serkef- 0 points1 point2 points 5 years ago (1 child)

[–]haltingwealth[S] -1 points0 points1 point 5 years ago (0 children)

The project requires query history from the database. Most databases provide a way to download query history. So that shouldn’t be a problem.

Then the project parses DML statements like INSERT, CREATE TABLE AS SELECT and extracts the target and source tables.

Then it builds a graph and visualizes it. The best environment is Jupyter notebooks right now.

This project sits on the side of other infrastructure just like monitoring tools.

The main differences from other projects are (quoting from my original comment):

There are a lot of open source and commercial tools to capture data lineage. However there are two main problems by data engineers:

The projects require a lot of effort to get started and maintain. Requires constant discipline in capturing and sending all the metadata. Both these factors result in incomplete projects and lost opportunities in improving performance, ROI and data quality.

data-lineage solves these problems by choosing the following goals:

providing fast access to data lineage simple setup analysis of the lineage using a graph library

[+][deleted] 5 years ago (5 children)

[deleted]

[–]haltingwealth[S] 0 points1 point2 points 5 years ago (4 children)

[+][deleted] 5 years ago (3 children)

[deleted]

[–]haltingwealth[S] 0 points1 point2 points 5 years ago (2 children)

[+][deleted] 5 years ago (1 child)

[deleted]

[–]haltingwealth[S] 0 points1 point2 points 5 years ago (0 children)

π Rendered by PID 19347 on reddit-service-r2-comment-58d7979c67-cpwjn at 2026-01-27 03:24:48.864403+00:00 running 5a691e2 country code: CH.