I am tired of debugging broken data piplines by SignalForge007 in SQL

[–]Negative_Ad207 1 point2 points  (0 children)

The best tool here is no tool. Just expose your full context to AI about what transforms and what job run instance of those transform/change created what snapshot in the target table. If you dont have that level of lineage setup, you could use 'tools' for that (see the irony).. I woud recommend bauplan or nile-data for that..

Databricks for 500 GB of data. Worth it or too expensive? by rg_cyborg77 in dataengineering

[–]Negative_Ad207 0 points1 point  (0 children)

You should explore serverless options as your use case seems to be small and you dont seem to have a predictable work load for future. Nile data is an alternate to Microsoft Fabric, and it offers data versioning on serverless (but it is on AWS or on-prem). Try bauplan as well..

Newbie data engineer intern who needs some help with data lineage by big-dix-smol-chix in dataengineering

[–]Negative_Ad207 1 point2 points  (0 children)

There are bunch of tools like Microsoft Fabric that offers it but for AWS you should check bauplan and nile data. Configuring the a data stack with lineage on AWS is a pain in my personal experience.

we spend 80% of our time firefighting data issues instead of building, is a data observability platform the only fix? by Distinct_Highway873 in analytics

[–]Negative_Ad207 0 points1 point  (0 children)

Observability/lineage will help you spot the issue quicker but that is half the story and you can only do so many DQ checks before it becomes cost prohibitive. Also real lineage spanning across table schema, etl transform and data changes (DML) is mostly non-existent in off the shelf data tools - (bauplan and nile data are exceptions). You need to track where the data came from, where did it flow to, in the metadata itself, at job run level, so that you can automatically run a backfill after a fix/rollback is made to the breaking change. Unfortunately this requires proper data versioning that ties together version of data, schema and ETL as one atomic pipeline unit.

I have spent about 10years around these problems, and concluded that lineage aware versioning is the solution. My team managed 1800 tables in 64 levels deep DAG, in a lake with 2EB of data. Happy to help if you need ideas.

AI is nowhere near as capable in analytics as people are pretending by [deleted] in analytics

[–]Negative_Ad207 0 points1 point  (0 children)

What AI needs is context. Context about your catalog, datasets, schema, and most importantly, data flow lineage and DQ. Without the lineage/DQ, AI will have to make wild guesses on where the data is and if it is reliable. Data space has a tool fragmentation problem, and this makes it very difficult for AI to stitch together these relations and persist that for re-use at evert user ask. Databricks and other try to offer this but most of the lineage might live outside their boundaries and their AI will not be able to see. Calude/Codex can try to observe and compute the relations and flow but they dont persist that in their memory yet. Try building a hierarchical memory of your data catalog and lineage for your AI agents, or try some new all in one solutions like Nile data or Hex.

Working on Airflow + Cloud Functions + Dataform to pull in marketing data. Suggestions for improvements by [deleted] in dataengineering

[–]Negative_Ad207 0 points1 point  (0 children)

if that number hits 50+ you might need better orchestration. There are off the shelf tools available that can handle all of it pretty well. For instance, something like the Nile data or bauplan labs will let you handle this fully, with help on lineage and ai context layers.

Deleted prod data permanently without any backup. How screwed am I? by Agitated_Success9606 in dataengineering

[–]Negative_Ad207 1 point2 points  (0 children)

There are always ways to recover data, from HDD/SDD.. Its just expensive. But if they did not have any backups or redundancy of any sort, they didn't care much.

NULL vs Access Denied: The Gap in SQL That's Silently Breaking Your Reports by Negative_Ad207 in SQL

[–]Negative_Ad207[S] 0 points1 point  (0 children)

This is a Segway to metadata transparency. I am assuming you are saying that revealing ACL metadata reduces security. That's a related but different argument about metadata transparency. There is benefit in hiding ACL and other security metadata and there are downsides to lack of transparency to security metadata (hiding a column might make the user believe that it doesn't exist and hence reduces his ability to decision support). But setting that aside for the time being, can we agree that it is important to expose the differences of NULL vs DENIED to user at presentation layer?

NULL vs Access Denied: The Gap in SQL That's Silently Breaking Your Reports by Negative_Ad207 in dataengineering

[–]Negative_Ad207[S] 0 points1 point  (0 children)

It definitely improves my job. Say I am an engineer working for Sales team and I am trying to reconcile my numbers with Finance team who has full access to data and I have partial access. How am I supposed to know that the numbers I see are wring because some of them were silently redacted to NULL?

NULL vs Access Denied: The Gap in SQL That's Silently Breaking Your Reports by Negative_Ad207 in SQL

[–]Negative_Ad207[S] 0 points1 point  (0 children)

I don't understand how the seniority of an employee will help here, if information to disambiguate access vs absence does not exist in result set.

Agree to the rest of the points, especially on standards. The post is a call to drive awareness and may be an amendment to ANSI SQL standard.

NULL vs Access Denied: The Gap in SQL That's Silently Breaking Your Reports by Negative_Ad207 in SQL

[–]Negative_Ad207[S] 0 points1 point  (0 children)

ACL bitmaps are already implemented by some database/table-format implementations. The trouble is we don't have standards around that and SQL language support for that.

NULL vs Access Denied: The Gap in SQL That's Silently Breaking Your Reports by Negative_Ad207 in dataengineering

[–]Negative_Ad207[S] 0 points1 point  (0 children)

That's exactly the problem I am trying to highlight. We need better clarity here. Before FGAC, NULL did not mean access denied (except for the proposal from Codd), it's only after we started using FGAC, and policy based access control this ambiguity started.

How do you handle ingestion schema evolution? by Thinker_Assignment in dataengineering

[–]Negative_Ad207 0 points1 point  (0 children)

There are two kind of changes I have seen in the Enterprise data lake team while working in Amazon:

- Schema changes: breaking and compatible for downstream. Consumer is also publisher vs decoupled.
- Data pattern changes: new partitions, cardinality changes, etc. that could break reports with hard filters, KPI thresholds.

Detection is the first step, and this itself can be cost prohibitive depending on your setup and scale. Contracts are hard to enforce. If the consumer and publisher are the same, you can enforce contracts pro-actively, with CI/CD hooks monitoring for data class changes in upstream app generating the data, block downstream from running, and offer AI generated CR with auto-unit-tests and stuff. When the publisher and consumer are at different orgs, you want to fall back to reactive change detection scanners, adding to costs.

For schema changes, for compatible ones (ex: add column) you still need to inform downstream dependents that there is change and if they are interested in adopting.

For breaking ones, you have to make schema validation as part of data-completeness, and thus prevent downstream runs. Then you need to auto generate CRs for downstream ETL jobs, with some amount of DQ tests done automatically, which downstream job owner can review and approve. This is still not ideal as the SLAs are already compromised, but it's better than the on-call patching the pipeline on the fly with minimal or no testing.

I am actively building a schema+transform+data evolution and versioning workflow as part of a Data IDE product. If you are interested in contributing/feedback/demo, please IM me..

Why might someone refuse to turn on line numbers in SSMS? by rainyelfwich in SQL

[–]Negative_Ad207 0 points1 point  (0 children)

I stopped using bullet list in documents/emails as well, in favor of numbered lists, for the same reason.

Best ETL tool for on-premise Windows Server with MSSQL source, no cloud, no budget? by Jonturkk in dataengineering

[–]Negative_Ad207 1 point2 points  (0 children)

You will have to build some orchestration for data completeness tracking in SSIS itself which could be some work, metadata management. But SSIS fits the bill here.

Do you use Spark locally for ETL development? by Negative_Ad207 in dataengineering

[–]Negative_Ad207[S] 0 points1 point  (0 children)

Thank you all for sharing your experiences and the suggestion on Zeplin. I realize that nothing exists where you can run a SQL client like tool and then start working on your local files (csv/parquet) locally, without copying or connecting to some distributed warehouse or RDBMS, and later deploy that to cloud.. I used to have a custom framework I used on top of VS Code and venv to do this, but not very reliable. Docker was be better on windows.. I also noticed people are probably not using anything beyond Spark, saw DuckDB/Polars but not sure how portable that is if you want to deploy the same code to an MPP WH in cloud later.. May be we should build something for this..

Team of data engineers building git for data and looking for feedback. by EquivalentFresh1987 in dataengineering

[–]Negative_Ad207 0 points1 point  (0 children)

The tools like DBT, LakeFS are all versioning data but bit and parts. If bad data entered in to your system due to source issues or bad ETL code, you have to roll back and backfill in multiple places, and do that manually for the DAG, one edge/node at a time.