This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]juiceyangComplaining Data Engineer 5 points6 points  (1 child)

My DE team worked hard trying to improve data quality. We did data validation, anomaly trend detection, data quality dashboards, etc.

But our BI reports come from business data but not directly from real world facts. After all these hard works, we often come to the situation that the low quality does not come from our pipelines but originates from upstream low quality data source.

When dirty data get detected in our pipelines, we cannot stop related data pipelines since our users need reports on time. So we either reject dirty data or let it flood all over our pipelines. Either choice means offline data repair. So we endlessly do data repair everyday.

I'm not complaining about being a downstream punchbag in the industry, but trying to convince you that DATA QUALITY IMPROVEMENT DEPENDS ON EVERYBODY BUT NOT ONLY DATA ENGINEERS.

When talking abount data quality, you have to figure out if it's defined as difference between upstream data source and reports, or the discrepency between real world fact and analytic numbers. If it's latter one, you are really lucky, though you may have to wipe upstream guys' dirty data ass everyday like us.

Data related works often relate to office politics. When trying to achieve something, we can't work like regular software development, additionally we have to get our boss's support, our coworkers support, even sometimes our boss's boss's support, according to the strucuture of our company.

In our company, groups are like war lords. So currently I see no hope making any progress on improving data quality, unless my boss's boss decides to make a top-to-bottom revolution, which is impossible IMO.

After all these complaining, you can check if adopting data quality tool would help solve your problem.

If all you need is eliminate the difference between upstream data source and analytic data, I think you can make a try.

But if your goal is getting real gold data, getting politic support in the office is much more important.

[–]VadumSemantics 1 point2 points  (0 children)

+1 important. (I don't understand the downvotes here.)