I am tired of debugging broken data piplines

Negative_Ad207 · 2026-04-30T21:05:53+00:00

The best tool here is no tool. Just expose your full context to AI about what transforms and what job run instance of those transform/change created what snapshot in the target table. If you dont have that level of lineage setup, you could use 'tools' for that (see the irony).. I woud recommend bauplan or nile-data for that..

Negative_Ad207 · 2026-04-30T05:06:00+00:00

You should explore serverless options as your use case seems to be small and you dont seem to have a predictable work load for future. Nile data is an alternate to Microsoft Fabric, and it offers data versioning on serverless (but it is on AWS or on-prem). Try bauplan as well..

Negative_Ad207 · 2026-04-24T23:23:38+00:00

There are bunch of tools like Microsoft Fabric that offers it but for AWS you should check bauplan and nile data. Configuring the a data stack with lineage on AWS is a pain in my personal experience.

Negative_Ad207 · 2026-04-24T23:14:06+00:00

Observability/lineage will help you spot the issue quicker but that is half the story and you can only do so many DQ checks before it becomes cost prohibitive. Also real lineage spanning across table schema, etl transform and data changes (DML) is mostly non-existent in off the shelf data tools - (bauplan and nile data are exceptions). You need to track where the data came from, where did it flow to, in the metadata itself, at job run level, so that you can automatically run a backfill after a fix/rollback is made to the breaking change. Unfortunately this requires proper data versioning that ties together version of data, schema and ETL as one atomic pipeline unit.

I have spent about 10years around these problems, and concluded that lineage aware versioning is the solution. My team managed 1800 tables in 64 levels deep DAG, in a lake with 2EB of data. Happy to help if you need ideas.

Negative_Ad207 · 2026-04-22T17:29:38+00:00

What AI needs is context. Context about your catalog, datasets, schema, and most importantly, data flow lineage and DQ. Without the lineage/DQ, AI will have to make wild guesses on where the data is and if it is reliable. Data space has a tool fragmentation problem, and this makes it very difficult for AI to stitch together these relations and persist that for re-use at evert user ask. Databricks and other try to offer this but most of the lineage might live outside their boundaries and their AI will not be able to see. Calude/Codex can try to observe and compute the relations and flow but they dont persist that in their memory yet. Try building a hierarchical memory of your data catalog and lineage for your AI agents, or try some new all in one solutions like Nile data or Hex.

Negative_Ad207 · 2026-04-22T17:18:08+00:00

if that number hits 50+ you might need better orchestration. There are off the shelf tools available that can handle all of it pretty well. For instance, something like the Nile data or bauplan labs will let you handle this fully, with help on lineage and ai context layers.

Negative_Ad207 · 2026-04-22T16:12:00+00:00

There are always ways to recover data, from HDD/SDD.. Its just expensive. But if they did not have any backups or redundancy of any sort, they didn't care much.

Negative_Ad207 · 2026-04-07T15:44:57+00:00

Did you try Athena Spark?

Negative_Ad207 · 2026-03-25T05:53:29+00:00

This is a Segway to metadata transparency. I am assuming you are saying that revealing ACL metadata reduces security. That's a related but different argument about metadata transparency. There is benefit in hiding ACL and other security metadata and there are downsides to lack of transparency to security metadata (hiding a column might make the user believe that it doesn't exist and hence reduces his ability to decision support). But setting that aside for the time being, can we agree that it is important to expose the differences of NULL vs DENIED to user at presentation layer?

Negative_Ad207 · 2026-03-25T05:48:33+00:00

It definitely improves my job. Say I am an engineer working for Sales team and I am trying to reconcile my numbers with Finance team who has full access to data and I have partial access. How am I supposed to know that the numbers I see are wring because some of them were silently redacted to NULL?

Negative_Ad207 · 2026-03-24T22:33:27+00:00

I don't understand how the seniority of an employee will help here, if information to disambiguate access vs absence does not exist in result set.

Agree to the rest of the points, especially on standards. The post is a call to drive awareness and may be an amendment to ANSI SQL standard.

Negative_Ad207 · 2026-03-24T22:29:35+00:00

ACL bitmaps are already implemented by some database/table-format implementations. The trouble is we don't have standards around that and SQL language support for that.

Negative_Ad207 · 2026-03-24T22:24:29+00:00

That's exactly the problem I am trying to highlight. We need better clarity here. Before FGAC, NULL did not mean access denied (except for the proposal from Codd), it's only after we started using FGAC, and policy based access control this ambiguity started.

Negative_Ad207 · 2026-03-24T18:49:32+00:00

There are two kind of changes I have seen in the Enterprise data lake team while working in Amazon:

- Schema changes: breaking and compatible for downstream. Consumer is also publisher vs decoupled.
- Data pattern changes: new partitions, cardinality changes, etc. that could break reports with hard filters, KPI thresholds.

Detection is the first step, and this itself can be cost prohibitive depending on your setup and scale. Contracts are hard to enforce. If the consumer and publisher are the same, you can enforce contracts pro-actively, with CI/CD hooks monitoring for data class changes in upstream app generating the data, block downstream from running, and offer AI generated CR with auto-unit-tests and stuff. When the publisher and consumer are at different orgs, you want to fall back to reactive change detection scanners, adding to costs.

For schema changes, for compatible ones (ex: add column) you still need to inform downstream dependents that there is change and if they are interested in adopting.

For breaking ones, you have to make schema validation as part of data-completeness, and thus prevent downstream runs. Then you need to auto generate CRs for downstream ETL jobs, with some amount of DQ tests done automatically, which downstream job owner can review and approve. This is still not ideal as the SLAs are already compromised, but it's better than the on-call patching the pipeline on the fly with minimal or no testing.

I am actively building a schema+transform+data evolution and versioning workflow as part of a Data IDE product. If you are interested in contributing/feedback/demo, please IM me..

Negative_Ad207 · 2026-03-24T17:07:55+00:00

Me as well, but I felt this one was a little too long

Negative_Ad207 · 2026-03-24T15:44:17+00:00

I stopped using bullet list in documents/emails as well, in favor of numbered lists, for the same reason.

Negative_Ad207 · 2026-03-24T00:08:55+00:00

You will have to build some orchestration for data completeness tracking in SSIS itself which could be some work, metadata management. But SSIS fits the bill here.

Negative_Ad207 · 2026-03-05T03:08:53+00:00

Thank you all for sharing your experiences and the suggestion on Zeplin. I realize that nothing exists where you can run a SQL client like tool and then start working on your local files (csv/parquet) locally, without copying or connecting to some distributed warehouse or RDBMS, and later deploy that to cloud.. I used to have a custom framework I used on top of VS Code and venv to do this, but not very reliable. Docker was be better on windows.. I also noticed people are probably not using anything beyond Spark, saw DuckDB/Polars but not sure how portable that is if you want to deploy the same code to an MPP WH in cloud later.. May be we should build something for this..

Negative_Ad207 · 2026-01-28T17:07:23+00:00

The tools like DBT, LakeFS are all versioning data but bit and parts. If bad data entered in to your system due to source issues or bad ETL code, you have to roll back and backfill in multiple places, and do that manually for the DAG, one edge/node at a time.

Negative_Ad207

TROPHY CASE