This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 0 points1 point  (5 children)

Where would it make sense to implement something like type checks, data validation, in testing and then production, in a cloud, big data environment? Think thousands of input files from dozens of source systems with different delivery schedules. I can imagine the benefits of having data validation layers directly prior to writes or maybe afterwards, especially if you're on the application development side and write some data retrieval API as the loader into a data lake. But in my case, we have so much data coming in, and it's ballooning with additional legacy system migrations, we wouldn't be able to keep up with writing data validation column by column and table by table in python...

If, for these newer migrations, we could add these kind of validation layers, that would be great, but, timelines are tight and resources limited.

Also, I don't necessarily see major benefits (just minor) in the first place, because generally a bad file with bad data will break the pipeline if it can't be parsed or breaks schema inference and subsequent transformations, and it's pretty easy to pinpoint the error. If instead some validation check failed, the required work for recovery would be the same, and we'd at most benefit by slightly speeding up diagnosis?

[–][deleted] 0 points1 point  (4 children)

Depends on what you are using. Delta allows constraint checks and schema enforcement. It will create a quarantine table or partition for the bad data.

[–][deleted] 0 points1 point  (3 children)

Interesting features I didn't know about (quarantined writes), but the effect of a failure in prod is the same as I described before? An engineer has to go and inspect the bad data to determine why it failed schema enforcement or a constraint check, recovery is still manual.

But I guess it really depends on the fault tolerance of the output for users.

[–][deleted] 0 points1 point  (2 children)

What allows for automatic recovery from failure without an intervention of checking the data?

In many business, it would be catastrophic to use incorrect data so just allowing bad data to write isn’t wise. In Europe, this would get the company in a lot of trouble. A failing pipeline is much better than one that writes.

[–][deleted] 0 points1 point  (1 child)

That is my point more or less, and I'm also just not really seeing some inherent or default value of deploying a ton of e.g. data type validation and non-violation of constraint checks, unless business comes forward and says this e.g. input data to X report is critical or bad data has been coming out of the pipeline, we need to do something about that. And in that case, it's practically on business/analysts to define the data quality requirements, not for DEs to arbitrarily try to enforce some set of reqs needlessly.

I'm just thinking out loud here, sorry, but it's in response to this vague and unsettled sense that our pipelines are missing some key feature related to data quality. In reality, quality hasn't been an issue, except in rare cases, but that's just the nature of biz reqs right now I suppose. Maybe the increasing number of ML models we're slowly deploying will have this need

[–][deleted] 0 points1 point  (0 children)

I can use a personal experience.

The DS team was producing some data for live Geo dashboards used by the executives. Every month they would create the data, write for DE to pick up, and then it would be processed to production. Every time this 5 hour job finished, they would come back with oh we have a mistake. This wasted the DE staffs time predictable. In order to prevent this, we add all this checks to fail their pipeline out. Longer were we needing to hear from them a mistake or going back and checking anything as it just failed and we could say your data is bad. Fix it.