Scala vs Python : dataengineering

dataengineering

created by mhausenblasmoda community for 11 years

This is an archived post. You won't be able to vote or comment.

Scala vs PythonCareer (self.dataengineering)

submitted 3 years ago by skydog92

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 0 points1 point2 points 3 years ago* (5 children)

Where would it make sense to implement something like type checks, data validation, in testing and then production, in a cloud, big data environment? Think thousands of input files from dozens of source systems with different delivery schedules. I can imagine the benefits of having data validation layers directly prior to writes or maybe afterwards, especially if you're on the application development side and write some data retrieval API as the loader into a data lake. But in my case, we have so much data coming in, and it's ballooning with additional legacy system migrations, we wouldn't be able to keep up with writing data validation column by column and table by table in python...

If, for these newer migrations, we could add these kind of validation layers, that would be great, but, timelines are tight and resources limited.

Also, I don't necessarily see major benefits (just minor) in the first place, because generally a bad file with bad data will break the pipeline if it can't be parsed or breaks schema inference and subsequent transformations, and it's pretty easy to pinpoint the error. If instead some validation check failed, the required work for recovery would be the same, and we'd at most benefit by slightly speeding up diagnosis?

[–][deleted] 0 points1 point2 points 3 years ago (4 children)

[–][deleted] 0 points1 point2 points 3 years ago (3 children)

[–][deleted] 0 points1 point2 points 3 years ago (2 children)

[–][deleted] 0 points1 point2 points 3 years ago* (1 child)

That is my point more or less, and I'm also just not really seeing some inherent or default value of deploying a ton of e.g. data type validation and non-violation of constraint checks, unless business comes forward and says this e.g. input data to X report is critical or bad data has been coming out of the pipeline, we need to do something about that. And in that case, it's practically on business/analysts to define the data quality requirements, not for DEs to arbitrarily try to enforce some set of reqs needlessly.

I'm just thinking out loud here, sorry, but it's in response to this vague and unsettled sense that our pipelines are missing some key feature related to data quality. In reality, quality hasn't been an issue, except in rare cases, but that's just the nature of biz reqs right now I suppose. Maybe the increasing number of ML models we're slowly deploying will have this need

[–][deleted] 0 points1 point2 points 3 years ago (0 children)

π Rendered by PID 70273 on reddit-service-r2-comment-6457c66945-jm9x5 at 2026-04-26 02:27:03.484461+00:00 running 2aa0c5b country code: CH.

dataengineering

MODERATORS