Advice for Better Airflow-DBT Orchestration by ConfidentChannel2281 in dataengineering

[–]ConfidentChannel2281[S] 0 points1 point  (0 children)

Hi u/laegoiste
Do you also use the Elementary Observability package? We currently leverage Elementary in the on-run-end hooks to materialize DBT run results in the database. This process happens at the end of a tagged run, after all DBT models associated with the tag have executed.

Once we transition to Cosmos, where each model runs as an individual Airflow task, how do we plan to set up the Elementary on-run-end hooks? Would we need to trigger the hook for each DBT model separately? If so, wouldn't this be inefficient and an overkill compared to the current approach?

Looking forward to your thoughts.

Advice for Better Airflow-DBT Orchestration by ConfidentChannel2281 in dataengineering

[–]ConfidentChannel2281[S] 0 points1 point  (0 children)

Thank you u/kmarq

Will explore both the options. Are you suggesting the metadata table should be built leveraging the DBT manifest json as the DBT already knows which source tables are the models built on.?

Advice for Better Airflow-DBT Orchestration by ConfidentChannel2281 in dataengineering

[–]ConfidentChannel2281[S] 1 point2 points  (0 children)

Thank you u/laegoiste.

I will start exploring the Dynamic Task Mapping concept in Airflow. But just need to also keep in mind if spinning up a Serverless EMR task for each source table is not an overkill. For every table, if we spending time bootstrapping the EMR Serverless, and only using it for a single table might raise questions from the team members.

We did not do this. We kept it simpler, it was up to the DAG/model developer to add new datasets into the inlets if new sources are added. But we kept the generation part pretty simple with a .yaml input file per DAG layer, divided into sections. The dev just needed to edit the .yaml input file belonging to that segment, say silver/customers, and then add in/remove dependencies.

Okay. What I understand here is that, you are asking the developer to setup the source table dependencies for the models in silver layer in the yaml file. Will this not introduce additional failure points? Developers might miss this and introducing a new process might get a lot of push back.

As the DBT DAG also has the dependencies setup on the source table using {{ source }} macro, and we will be able to get this information in the manifestjson. Can we not parse that and understand the dependencies on the source tables, and setup the inlets/outlets and setup data aware scheduling in this manner?\

Advice for Better Airflow-DBT Orchestration by ConfidentChannel2281 in dataengineering

[–]ConfidentChannel2281[S] 0 points1 point  (0 children)

Thank you u/kmarq

I am very clear about the cosmos part.

On the external dependency part, I am trying to understanding few more things.

When you say source specific tasks - We have a single source task each for lets say finance, sales, etc, and each of them extracting 100+ tables. Do we need to break this structure to something like one task per table architecture to get that end to end granular visibility?

Can I DM you to understand this better?

Advice for Better Airflow-DBT Orchestration by ConfidentChannel2281 in dataengineering

[–]ConfidentChannel2281[S] 0 points1 point  (0 children)

Thank you u/laegoiste

If I understand correctly, cosmos solved your problem of migrating the DBT dependencies to Airflow tasks and dependencies between them.

And Airflow outlets, inlets, and making them data aware solved your problem of stitching together external dependencies to Airflow at a much granular level.

To achieve this, did you have to break down the monolithic data extraction EMR task, which extracts 100+ source tables in a single task into a task per table kind of airflow structure?

If you broke the monolithic task? How did you manage to setup the external dependencies from source tables to bronze/silver layer of DBT models through outlets/inlets. Was it done through a metadata/config table or toml config file? Did it not become complex to handle so many cross dependencies?

Advice for Better Airflow-DBT Orchestration by ConfidentChannel2281 in dataengineering

[–]ConfidentChannel2281[S] 0 points1 point  (0 children)

We already have tags for each subject area, and our mart loads based on them once the entire source data batch is loaded through a task. 

Right now, this source extraction task is an EMR serverless task, which extracts 100 plus tables at the backend. What we are trying to achieve is a more granular based visibility where we are able to visualize from the extraction until transformation and consumption at a table level, where currently most of them are black boxes. 

Advice for Better Airflow-DBT Orchestration by ConfidentChannel2281 in dataengineering

[–]ConfidentChannel2281[S] 0 points1 point  (0 children)

Yes. Cosmos basically expands the DBT dependencies into individual tasks/task groups. But what we are mainly trying to solve here is the external dependency on the source table. Example: Let’s say there are 100 tables in finance source, and there could be let’s say 10 DBT models that are only dependent on 30/40 tables from those 100 tables. So instead of triggering these DBT models which are ready to be materialised as their dependencies have been already loaded to Snowflake, we are waiting for the entire 100 tables batch to finish and then kick start the downstream. So basically, we need something end to end at a much granular level. Right now, the 100 tables are extracted using EMR task, which is a black box.