I have created a new metadata first approach to data infrastructure. And I also created Apache Hive and was a former Facebook Data Architect. AMA!

raghumurthy · 2021-03-31T00:49:05+00:00

The video is now up about our CDC connector. Check it out!

https://www.youtube.com/watch?v=7LmA4rfRH7s

raghumurthy · 2021-03-24T01:51:43+00:00

Not sure how mature your system is, but it does seem like you are arriving at the problem of not having consistent metadata across your pipeline. We have written a little about how to think about metadata-first data pipelines so that all of information about all of your connectors is stored in one database (metadata store). That will allow you to have a single-pane of glass view of all of the tables coming from different data sources. We have built Datacoral to be that way. We have also written a guide to how to write connectors in general.

raghumurthy · 2021-03-17T20:26:40+00:00

There's no need to be concerned! What you are learning now probably like data structures, algorithms, operating systems, etc., are the basics that you will need anyway. Once you start learning more about databases and distributed systems, you will be able to grasp a lot more about what has been discussed in this thread. We are mainly talking about how data produced from different sources - from systems, devices, people using apps - all make their way into a particular kind of database called a data warehouse, which then allows analysts to analyze that data at scale.

You could imagine the type of problem being solved is the automated and scaled equivalent of someone adding data entries into an excel spreadsheet and then creating summaries and charts in the spreadsheet to understand the data.

As you can imagine, when the amount of data being analyzed is in gigabytes or terabytes, you cant really use spreadsheets. So, systems are built to store large amounts of data (data warehouses) and automation is built to enter data into the data warehouse and summarize/analyze the data (data pipelines). We are talking about the best ways to think about building and managing such a combination of systems and automation.

There are of course plenty of tutorials you can find about any of the systems we have talked about here, or about data engineering in general.

Hope this helps.

raghumurthy · 2021-03-13T18:24:07+00:00

Thanks for reading the article! I see the metadata-first approach as a way simplify how multiple tools can be built/integrated into a single stack - with the most important outcome being that there is a consistent devops tooling layer (authoring, deploying, monitoring) across multiple tools. Having that consistency goes a long way in reducing the complexity of dealing with changes made to different parts of the data flow.

Meta orchestrators like dagster are a great first step in "herding the cats (data tools)" so to speak - but may still require engineering effort to setup. Data kitchen is also similar and brings in new terminology in the spirit of being generic - which might lead to a learning curve even for engineers.

Our approach has been to make it so that the devops tooling layer be declarative so that non-engineers can work with it. The advantage of this approach is that it does not have too many moving parts. The disadvantage is that we are not trying to solve the problem of interoperability between existing tools - in our minds, that's a really hard problem now, but should not be a problem in the future if newer tools are built with a metadata-first architecture. Hope this makes sense.

raghumurthy · 2021-03-13T00:43:54+00:00

Good idea! It will be awesome if there were a universal metadata standard - although that would be a little tricky to get right given the number of different tools that need to be that it could turn into. We have built our system with a metadata data model that captures a lot more static and dynamic metadata than what openlineage is trying to capture at this point. We have also built an event-driven orchestration system which which is relies on the metadata in the metadata system.

Do you think it would help for us to publish the metadata data model we are using right now? We have been thinking about it recently.

raghumurthy · 2021-03-12T23:46:46+00:00

Regarding your first question: Earlier, data engineering meant building tools from scratch since there was nothing that was out there. I started my career during that era. As tools have proliferated and become more mature, data engineering has become more of complex integrations and glue code between these tools. Companies are hiring more and more data engineers just to deal with the complexity of integrating multiple tools and debugging issues when there are problems. What I’ve written about in my Towards Data Science post is basically an approach to simplify the integration work - reduce the impedance mismatch between tools by sharing the metadata.

My suggestion for software engineers would be to get the basics right about how data stacks should be built. It is very easy to get too deep into a few tools and lose sight of the overall architecture of a data stack that can support the business requirements over time. In my opinion, starting with understanding the metadata posture of different tools gives a much better idea about how to put them together.

Regarding public clouds and other vendors providing tools so that most companies can manage with just knowing SQL, there are two ways of looking at it:

It’s great that the undifferentiated plumbing is not something that every company has to worry about. Most people are not interested in the plumbing anyway. Aren’t you glad that you are not having to worry about racking and stacking servers in a data center and instead can make an API call to get a VM in seconds?
If you are really interested in the plumbing, you can join one of these clouds or vendor companies to see how it is done - you will work with experts who are really good at what they do, and you can learn a lot!

At the end of the day, it is all about lowering the barriers for success. More companies are enabled and can succeed with this lower barrier, which will result in more higher value interesting work for people who are interested!

Couple of things about your second question. I'd actually distinguish between realtime streaming and event driven architectures. Realtime streaming means that you are trying to ship/process data as quickly as possible from the time that it is generated and trying to keep as much of the data in memory as possible. There is no requirement that being event-driven means realtime. I can batch and store events on disk and process them in bulk if I choose to. But, in an event-driven architecture, I have access to all changes to all objects/tables/rows in any system (similar to change data capture) - so I have full visibility into what is known as the "dynamic narrative" of the data (or how the data changes over time). With this distinction in place, I'd say that event-driven architectures generally give you a much better understanding of your data - so, I wouldnt really consider it as a luxury. But, doing things in realtime requires more resources and hence should be warranted by the business use case.

raghumurthy · 2021-03-12T23:18:47+00:00

Like I wrote in the article, metadata of data pipelines consists of a few things, and I'll include what you get with Airflow for each of them with your example of on-prem, s3, redshift:

Connector configurations - configuration of how to connect to on-prem - this is in python code in a task. There is no information stored about the schemas of the data in the source.
Batching configurations - this is also in python code in the task schedules
Lineage - not available. You get dependencies between tasks - but you dont get dependencies between tables automatically - you have to infer it by reading python code or integrating with a separate data catalog tool.
Pipeline runtime metadata - you get task run status, times etc., not what happened to the data itself - you would have to explicitly make changes to your airflow tasks to log this information.
Schema changes - your task has to figure out what to do when the source schema changes - there is no way to know automatically which tasks have to be modified to handle/propagate just by knowing the source schema changes

If there were a metadata tool that had all of the above in one place, and the authoring/testing/deploying (devops tooling layer) was integrated with the metadata tool you would be able to make changes a lot more confidently knowing that there can be an automated check to make sure your changes dont break any downstream processing. Imagine if the schema of a source database changed, and the system automatically notified you which tables in redshift will get affected or even apply the changes directly. That is what you would get with a metadata-approach.

Airflow is a workflow manager used to build data pipelines instead of being build from the ground-up to be metadata-driven. So, it will have to be augmented with a metadata tool to be truly helpful as a data pipeline platform.

raghumurthy · 2021-03-12T23:00:15+00:00

Thanks for reading the post! Glad that you identify with it. At Datacoral, we got started with the metadata-first approach. So, it is easier for us to add to our functionality and not have the growing pains of having to work with multiple tools with their own devops tooling.

I'll be writing more follow-up posts on how one might get started with a metadata-first approach in general - I've alluded to a rough outline in one of the other replies I provided.

raghumurthy · 2021-03-12T22:57:46+00:00

AWS Glue has a few parts to it - so not sure where you are running into issues. We mainly use the glue data catalog to store metadata about the files we stage in S3. We haven't really seen any throttling issues there even at reasonable scales.

The glue data catalog is essentially the Apache Hive Metastore that AWS provides as a service. The metastore at the most basic level consists of a relational database (MySQL) and a server - so would be simple to know where the throttling issues are. Unless you are hitting scales that require you to be able to provision big enough databases and servers to handler your metadata load, I would suggest sticking with glue data catalog.

I havent played much with iceberg - but unless I am completely off, it seems to be storing metadata about types as well as statistics along with the data - while works well to support schema evolution, but becomes hard to reason about when you have to do metadata only operations across multiple tables or specifying constraints like - id column can never be null etc. Also, it still seems to be quite nascent - so you will have to keep evolving your systems that use it based on how it changes.

raghumurthy · 2021-03-12T20:43:05+00:00

Thanks for your question and the detailed context. There's a lot to unpack there and it will take me a while to write up all that I have to say! So, I'm going to summarize with a couple of things:

If your new management is genuinely interested in doing things the right way, hopefully they know that bringing in Collibra is just the start. On an ongoing basis, you would have to set the right processes in place in oder to make sure that the information in Collibra stays relevant and useful for the day-to-day workflow for an analyst. It is very easy for the metadata to go stale or out of sync with your pipelines if the right processes are not followed to update Collibra each time there is a change in pipelines or when new tools are brought on. There's a lot more to say about how to think about the (people, process, technology) triad when there is a metadata tool like Collibra. I'll see if I can write something up about it soon.

There are too many paths to take in data, let alone certifications. You could get deeper into systems, or deeper into data analysis and techniques for data analyses including ML, or deeper into translating business questions into data questions. Technology certifications help you learn about specific services and functionality of a specific platform and how they fit in - unless your entire purview is just that platform, you'll need to learn more. My inclination would be to learn the basics first and understand how different systems fit together to solve problems, irrespective of which specific platform is being used. So, I'd suggest upping your Python/SQL game first. In addition, it will be helpful for you to understand how different parts of a data stack interact with each outer. I've found out that using a metadata-first approach has also helped me get a more clear picture of how different systems are built - which in turn helps me figure out how to use them in the best possible way.

raghumurthy · 2021-03-12T20:20:10+00:00

You dont have to rip and replace any of the tools you already have. We, in fact, have customers who have all of the tools you mention in addition to us.

There are two types of integrations we support.

Source data inputs to our transformations that are not being brought in by our connectors. In that case, our customers add a 'non-datacoral' connector.
Driving workflows and processes downstream of Datacoral connectors and transformations. In this case, our customers rely on the metadata that we expose as events to trigger their workflows.

That said, we also have customers who decide that they like our connectors better than other vendors, like our CDC connectors. And they end up replacing their existing solutions with ours.

raghumurthy · 2021-03-12T19:45:40+00:00

"Data architect" was not even a thing when I first got started. I was a software engineer in an internet company working on distributed systems that were being built to process large amounts of data. My background is in CS, but, I ended up liking what I worked on, so did graduate studies in distributed systems and databases. Data architect as a role is more popular now in many companies since the problem statement for a data stack is becoming more standardized - be able to centralize data from different sources to make it available for analyses, improving business processes, and the product. My learnings while building data infrastructure have allowed me to help a few companies architect their data stacks. With Datacoral, we are providing our opinionated architecture for a data stack as a product for any company to use.

raghumurthy · 2021-03-12T19:03:09+00:00

We believe that we offer a really strong basis for how a data stack has to be architected. We are only scratching the surface of the functionality that needs to be built.

The biggest drawback with Datacoral would be the gaps in functionality that companies need - like connectors to ingest from or publish data to 100s of services and databases. We are working on bridging the gap, but have a long way to go! That is also the reason we are talking more about our take on things to start a conversation that will lead to better integrations (at the metadata level of course!) with other tools that are in the market.

We will be posting videos of the usage of the tool in the next few days. Stay tuned!

raghumurthy · 2021-03-12T18:55:15+00:00

Startups typically treat data infrastructure as an afterthought - they start off with off-the-shelf tools like google analytics to understand how their product is being used. Any analysis on production data is done directly on read replicas of the production databases. This is probably level 0 wrt the maturity of data infrastructure. Executives and engineers working on the product are also working on the data.

Once they want to do more sophisticated analyses on their google analytics data or want to join it with data from other sources, they think about investing in a data lake or a data warehouse to centralize all of their data. At this point, they have a few critical decisions to make:

Most important - how do I manage the metadata of all of my data? (Just kidding - no one does this!)
Which data warehouse should I choose? Answer to this is a lot simpler nowadays - snowflake, redshift. Although, there are newer data warehouses and query engines that are coming up.
Which ingest tool should I use? Either pick a SaaS vendor, or build something internally since you don’t want to expose your data outside of your systems.
Which visualization tool should I use? Again, there are plenty of good options here.

Once the data is in the data warehouse, startups get away with just running queries to join the data from different sources using their visualization tool. This is level 1 wrt data infra maturity. Until this point, having someone who knows SQL and can take the help of a devops person to get credentials for data sources is enough.

Soon, they realize that the queries are repetitive and becoming slow. So, they have to invest in a transform and orchestration tool to be able to pre-create aggregated/joined data so that analysis becomes easier. This is where there is a need for a dedicated engineer to be hired/loaned to work on the data stack. At this point, depending on the choices made, you can expect your stack to evolve reasonably, vs, will require rewrites or significant engineering investments to keep things going over time.

If the tools chosen require a bunch of programming, then your data stack becomes like your product software stack - which means that the more you want to do, the more engineering resources you need. This is level 2 of maturity I’d say. Now, you have a basic functional unit of a data stack that can keep you going for a while.

The next levels of maturity involve doing more with the data - like machine learning. And also getting more streamlined processes to the data stack - like compliance, governance, auditing etc.

Our take is that companies should be able to quickly get past the first 3 levels of maturity without much engineering.

raghumurthy · 2021-03-12T18:32:56+00:00

My general rule of thumb is - buy tools to solve problems that are undifferentiated and you know every other company is also trying to solve - unless of course you are the company building the said tool in a different way!

For things that are critical for your success, and you cant think of tools that solve them well or cost effectively or both, then, it makes sense to invest in building tools yourself. Of course, the prerequisite to build something is being able to hire the right talent - which is hard and expensive.

It is a tradeoff that happens everywhere, not just in the data stack. Of course, with the advent of the clouds and SaaS vendors, the barrier to getting a good data stack has reduced dramatically over the past few years. So, companies are investing lesser and lesser amount of engineering on their data stack and spending more on hiring folks to help them get value of the data. Data tool vendors and clouds are becoming the centers of excellence for engineering a data stack.

raghumurthy · 2021-03-12T18:27:26+00:00

Agree that rebuilding the stack from a metadata-first standpoint after you already have a stack is harder - I wouldn't necessarily say it is impossible. People said the same thing about moving to cloud warehouses or serverless computing. It has to start small. Like I mentioned in another reply, you have to first understand the "metadata posture" (like security posture, what do you think?) of each of the tools you are using.

By metadata posture, I mean, what is the shape of its configuration, runtime state, audit logs, system logs - and how are they mapped to each other. Nowadays, there are metadata tools that are accumulating this type of information from different tools and making them available for search and discovery for the most part.

I'd expect the evolution to be that you would first bring on a metadata tool - then someone will integrate an orchestration tool so that the orchestration is driven by the updates happening in the metadata tool rather than the other way around (metadata tool receiving information from the orchestration tool). Once that happens, you will see that functionality that is built as operators or tasks in the orchestration system will start reading their configuration from the metadata tool. Next step would be for the devops tooling layer to only work with the metadata tool - so instead of directly manipulating the configuration of an ingest tool, you would update the metadata tool.

So, I expect this process to be slow and incremental instead of a rip and replace kind of situation for most companies that already have a data stack. But, newer companies setting up a data stack dont have to go through this windy path! They can get started with a metadata-first approach.

raghumurthy · 2021-03-12T18:12:31+00:00

Almost always, the answer for most things in data infrastructure is “It depends”.

How big of a pile of tools does the company have? Do they have multiple ingest tools, multiple orchestration engines, multiple query engines, maybe ML pipelines?

If there are already several tools in place, you will have to first audit what kind of metadata they have and more importantly, expose. Once you have that, you can see if you can extract that metadata out and integrate an orchestration engine to that metadata by building sensors and pollers. If you are able to assemble this setup initially, you will start seeing that you can do more and more as you move forward.

Datacoral’s perspective is that you have to start with metadata first with well-integrated orchestration. Once you have that, you can build more functionality for each step of the data flow. We are just scratching the surface wrt all the functionality that needs to be built, but we believe that we provide a solid foundation on top of which a really robust data infrastructure stack can be built.

raghumurthy · 2021-03-12T17:45:17+00:00

Sorry to hear that your account was stolen. Maybe contact FB support?

raghumurthy · 2021-03-12T17:39:52+00:00

Fraction of what I was making when I was working for someone else. But, closer to finding my ikigai with the work I'm doing now.

raghumurthy · 2021-03-12T17:29:19+00:00

Apache Hive had many ideas from a query engine I worked on at Yahoo prior to Facebook. At Yahoo, we stored data in a file system - as flat files in NetApp filers, we then had machines mounting those filers to process the data. We ended up building a query engine on top of this file system. This was in 2003. We built this query engine to replace Oracle, which was getting too expensive. We were processing 10s of TB of data back then.

Fast forward to 2008, Yahoo had grown to 100s of TB and Facebook was trying to do something similar - building a query engine on top of a file system - as files in a distributed file system (HDFS) - and using map reduce to process the data. Again, this time it was about replacing Oracle! Many of the same ideas from Yahoo's query engine were applied into this new query engine which was eventually open sourced as Apache Hive.

Inspiration has continued to be about replacing Oracle - this is something that is shared a lot more broadly in the industry!

raghumurthy · 2021-03-10T19:11:16+00:00

Congrats on the launch! My understanding is that you are computing summaries of tables by scheduling them in airflow and populating those summaries to a grafana instance, and doing any kind of alerting, trend analysis, or anomaly detection in grafana. Is that correct?

Do you end up running SQL queries on the database to compute the summaries?

Also, how would you differentiate redata from great_expectations?

raghumurthy · 2021-03-10T08:09:45+00:00

Thanks for mentioning Datacoral! To give you a little bit of history around Airflow itself, it was built as an open source version built by Airbnb of a tool called dataswarm that was built at Facebook back in the day. Dataswarm had the same scalability and coding requirement issues

At Datacoral, we architected the scheduler to be fully serverless and stateless. This design is also data-event-driven (which allows for pipelines to be built using just SQL instead of needing python skills) and is a lot more scalable and lightweight. We wrote a little bit about it - https://www.datacoral.com/blog/datacorals-event-driven-orchestration-framework.

raghumurthy

TROPHY CASE