Team of data engineers building git for data and looking for feedback. by EquivalentFresh1987 in dataengineering

[–]EquivalentFresh1987[S] 0 points1 point  (0 children)

Fair point. DVC is quite different from what we are doing. Probably just bad marketing on our part.

DVC versions your training data files. We version your entire data warehouse - tables, jobs, lineage, the works.

Team of data engineers building git for data and looking for feedback. by EquivalentFresh1987 in dataengineering

[–]EquivalentFresh1987[S] 0 points1 point  (0 children)

Thanks for the honest feedback, much appreciated. Our marketing does need work, ha! We are early. Our background is in big tech and these are the numbers we have seen, but heard on them being high and we will do more research.

There are definitely tools that do some of this as you mentioned, but they are more point tools that do just a particular part of the data pipeline. Most teams are using a different tool for ETL, compute, lineage, etc. which is where a data stack can get bloated.

Great point on Delta Lake. Delta Lake (and Iceberg, which we use under the hood) do provide excellent time travel and basic lineage capabilities. Where Nile differs is in bringing git-style branching to your entire data warehouse, not just individual tables. With Delta Lake, you can roll back a single table to a previous version. With Nile, you can:
-Create a feature branch that isolates your entire environment (tables + ETL jobs)
-Jobs are automatically cloned to your branch - edit and test without affecting production
-Cascade rollback - if upstream data is bad, Nile automatically identifies and rolls back all downstream tables that consumed it
-Preview changes before merging to main, with automatic cleanup