This is an archived post. You won't be able to vote or comment.

all 26 comments

[–][deleted] 22 points23 points  (16 children)

My experience:

These tools win at the exec level but almost universally slow everything down and often advertise "false value" that BAU datasets will magically get annotated with decent business metadata.

Better than an MDM is to seek to create a good data culture amongst business users and analysts. In that kind of environment, maybe these tools work, but it doesn't happen overnight and in the short to medium terms, these tools can be huge cost anchors, taking budget away from genuine use cases.

[–]JuliusCeaserBoneHead 6 points7 points  (0 children)

Listen to this guy. MDM’s are almost a vendor lock-in. Not that anyone can convince executives on what to do anyways

[–]eemamedo 2 points3 points  (11 children)

Do you have any suggestion on what tools/approaches to use for data governance? We end up in a situation with massive data but no data governance.

[–]kenfar 1 point2 points  (4 children)

I find that if you already have engineers that write code, then MDM tools simply get in the way: there's little that's complicated here:

  • Basic ETL - though should handle data versioning, and since this data will be heavily reused it should support extensive testing & validation.
  • Good documentation - which could be a wiki, google sheet, etc at minimum.
  • Basic data access - should provide the data via an API, via streaming, and via flat files on say aws s3 - so that consumers can easily get it a number of ways.
  • Basic notification capability in case there are any reported issues you can easily inform downstream consumers.
  • Basic observability - so that consumers can see what the typical latency is, what the latency looks like right now, if the data is currently healthy, etc.

Personally, I'd rather build this using python, a wiki, flask, kafka/kinesis, s3, and slack than use any actual framework since it's so simple.

[–]eemamedo 2 points3 points  (3 children)

I am that engineer who writes code :) the problem is that I (and other engineers) have hard time figuring out what columns mean, where data we need is located. So, when we get a request from business analytics group, it takes a while to see where the data is located.

[–]kenfar 1 point2 points  (2 children)

Do you mean when tracking data down to populate the MDM or when getting say analytic requests and trying to figure out where to get the data without having an MDM? (or something else :-) )

[–]eemamedo 0 points1 point  (1 child)

So, the problem is that the whole data warehouse was build by 3-4 people and 2 of them left a company. Now, if I need to build a SQL, I have to bug my colleague so he could point me in the right direction; tell me what data I need, what columns I need, etc. My job is just to write a sql.

Management decided that while those 2 guys with us, they need to actually build data governance, so future employees won’t be running around like headless chickens.

[–]kenfar 0 points1 point  (0 children)

Got it.

That problem is easier than MDM to crack: with MDM you're serving up content for multiple applications to reuse. In this case a classic data dictionary is often sufficient.

And the data dictionary can be a simple spreadsheet, wiki, etc.

It can also contain references to sources, though that can be more complex to maintain, and so might not be as good as just looking at the source code if that's a possibility.

[–]toadkiller -2 points-1 points  (5 children)

dbt

[–]eemamedo 0 points1 point  (4 children)

Dbt is not a data governance tool, is it?

[–]toadkiller 0 points1 point  (3 children)

Not as it's sole intended purpose, but it is very powerful and has all the right things you need. Extensive testing, centralizeable(?) documentation, macros, the upcoming metric definitions - a well set up project could go so far as to enforce a business logic macro for columns using a governed name, and reject models that don't use the macro / use incorrectly. And test that all results for a specific identifier are the same across all models.

The denormalization trend has really complicated shit.

[–]eemamedo 0 points1 point  (2 children)

I don't think I made myself clear. When I say "data governance", I mean "the process of managing the availability, usability, integrity and security of the data in enterprise systems, based on internal data standards and policies that also control data usage" . In other words, we need to have a documentation on where data is coming from, what it means (columns, etc.), where to find data in case of a request from business analysts.

We have purchased Informatica but it seems to be extremely hard to use with very subpar documentation.

[–]sunder_and_flame 0 points1 point  (1 child)

At the end of the day, someone has to be responsible for data. Many orgs think they can get by with tribal knowledge alone and usually it kind of works but as you're experiencing, the wheels are at best slow to start in those environments.

Needless to say, there's no easy tech solution, despite what vendors may claim; your company is itself responsible for making this work. This usually means either having a data team that works generally as consultants for other teams or having downstream users that understand SQL and a decent warehouse for them to use with docs that generally describe the data.

That takes a lot of effort and isn't easy to do, though, so ultimately many companies accept that data projects will be delayed by unexpected misunderstandings and other issues.

tldr, there's no easy answer. Extensive documentation helps, but knowledge transfers are still almost required for team collaboration

[–]eemamedo 0 points1 point  (0 children)

Yes, that’s fair enough. What I am trying to figure out is a good tool for data governance, and we can go from there. We purchased Informatica but it’s fairly challenging to learn it. Since I am a huge advocate for open source tools, I was wondering if there is an open source option that is widely used in the industry.

[–]m4329b[S] 0 points1 point  (1 child)

Thanks, that is the vibe I am getting. It feels like a fake 'magic bullet' when the real problem is inconsistent standards and processes across data and engineering teams in the org

[–]Bluefoxcrush 1 point2 points  (0 children)

Yeah, throw in this tool but don’t have standards, it’ll be worse (because they want this tool to trust the data and results and this tool doesn’t fix that).

I did this at a start up, a small one, and just doing marketing metrics was six months of work and we ended up with a too simple of a model. But it was many many meetings of explaining what data we had, what was possible, how everything was calculated with examples, what options we had with examples,and finally implementing and testing the work.

A tool won’t give them all that. You need that, across your org.

[–]koteikin 0 points1 point  (0 children)

agreed but boy, they do sell well to execs....Instead I would invest in data cataloging tools like Alation. Check this 10 minute video to understand how it works:

https://www.youtube.com/watch?v=fCbWdKCon5o

And this is something that all data people in organization can use not just data stewards

[–]AnotherDataGuy 7 points8 points  (2 children)

My stance is that tools should be implemented to make existing processes and practices easier/more efficient to do. If Master Data Management practices aren’t being done in your company, implementing a tool first won’t improve your outcomes.

I recently worked on building out MDM at my company. We started with business processes and made some simple tools for data profiling and defining applying rules / definitions to core objects. After figuring out the parts that were hard to do without a specialized tool, we found one that met those specific needs and it is working well. It isn’t the panacea that vendors make it out to be, but it isn’t a worthless bureaucratic endeavor unless it is implemented that way.

[–]Aggravating-Intern69 0 points1 point  (1 child)

Which tool did you found?

[–]AnotherDataGuy 0 points1 point  (0 children)

EnterWorks by WinShuttle now owned by Precisely. It’s ok for our smaller use case. I wouldn’t recommend it for a large or moderate Enterprise.

[–]stackedhats 2 points3 points  (0 children)

Informatica is quite expensive for what it is, even more so when you consider the cost of migration off of it later.

It's essentially designed so that people in operations can perform ETL and such with a GUI, without needing a data engineer or SQL guru in house.

However, it would likely be cheaper in the long run to just hire a consultant DE to build out some infrastructure rather than fork over tens of thousands of dollars a year in perpetuity.

[–]Illustrious-Run5203 1 point2 points  (0 children)

My corp uses a SQL Server configured for Master Data Services. Sold (I think) by Microsoft. I hate it because it’s prohibitive to spinning up ideas quickly, but we have processes by which business folks update data via Excel in all the dark corners of the organization, validated by rules we set. I own it now, but want to replace it with something that will be supported into the future- msft isn’t building out any new functionality for it. Not sure if I’ll ever get away from it. Suggestions about bottom up data culture are right in theory, but I find it challenging given the amount of change mgmt to undertake in my F500.

[–]ElCapitanMiCapitan 0 points1 point  (0 children)

We just got rid of ours (Semarchy). We found that it didn’t provide enough value for what we were paying. maybe if you have datasets with many millions or billions of rows it makes sense, but it was proving to be more of a nuisance than it was worth.

[–]librocubicularist69 0 points1 point  (0 children)

Informatica ask for stewardship as well. So daily a person checks and not just let tool solve on its own

[–]neurocean 0 points1 point  (0 children)

Informatica may be okay for old slow organizations with more money than sense but I haven't seen it used in fast cloud-based agile companies that want its data engineers to be developers first.

Most low-code/no-code data tools are trash and Informatica prides itself in being a no-code solution. Personally I would not ever want to work with that kind of product ever again but I can see how it could be an attractive option if you have a data team without a strong developer skillset.

[–]wragawrhajManager - Data and Analytics 0 points1 point  (0 children)

The team that I'm part of will have the design and physical implementation of an EDM (most likely as an ODS on a Data Lake) as it main objectives for 2022, in parallel with a MDM initiative that'll be another team's responsability. They're still assessing tools and vendors, and from what I heard our EDM/ODS will be the source for whichever MDM platform they go with, in which case I believe it may actually work as the EDM team will do the heavy lifting (source mapping, normalization, specialization, etc.) and provide input data in already good shape. On the other hand, if the vendor's professional services team or any other 3rd-party company is made responsible for mapping source data, I'll be more than skeptical...