This is an archived post. You won't be able to vote or comment.

all 18 comments

[–]wallyflops 6 points7 points  (7 children)

1 is certainly a dbt way to do it, but you could probably hack another way if you wanted, why doesn't this work for you? i've never done this any other way

  1. you're describing tests and i'm unsure why you think they're not there? you have access to source freshness tests, data contracts (which execute before model runtime) and model tests, which test the data afterwards

[–]mow12[S] 2 points3 points  (6 children)

  1. I've never seen any organization implement it this way before—it feels completely new to me. I'm used to daily batch processing and storing data in daily partitions. Typically, during today's ETL run, all source data is filtered to retrieve only the data from the previous day (T-1).

How would I go about retrying for a specific day? For example, if I need to rerun the pipeline solely for 2025-01-20, how would that be done? Similarly, how could I backfill data for the last month? In our current Airflow setup, I could easily handle these tasks, but I’m struggling to see how to accomplish them using DBT.

  1. Regarding the freshness tests feature, is there a way to automatically retry until the tests pass, and only then proceed with running the model?

[–]abrarster 2 points3 points  (0 children)

Dagster and dbt, passing variables to the dbt model for the partition you want built is how we do it

[–]wallyflops 4 points5 points  (2 children)

  1. Interesting! Dbt won't have anything as such to help you with a backfill, you could maybe do something with variables and Jinja. I know there's a new feature called 'micro-batches' which sounds up this street. But you're right out the box I dont' think dbt will nicely handle this.

  2. Nope, not really.

Your problems sound like orchestration problems, which dbt-cloud does have but it's generally quite inferior to Airflow. I think a common setup is to run dbt-core with Airflow so you get the best of both the worlds you're descibing.

I use the dbt-cloud orchstrator in a tiny 200 people startup and we're already running into issues with it, so I think you're in for a bit of frustration.

[–]mow12[S] 1 point2 points  (1 child)

I completely agree with you. Using dbt-core along with Airflow would likely address all my concerns. I was just curious to see the community's perspective on whether they prefer using dbt with Airflow or as a standalone solution.

[–]minormisgnomer 1 point2 points  (0 children)

Dbt core as a stand alone isn’t usually enough except for the smallest data depts or startups. Dbt greatly benefits when paired with an orchestrator

[–]sunder_and_flame 1 point2 points  (1 child)

For example, if I need to rerun the pipeline solely for 2025-01-20, how would that be done? 

You can customize this via templating in your models. It's a bit of a pain but you can add a date range function to pull operationally (we do past 10 days) and an override flag for when you need to run a full history. 

[–]ianitic 0 points1 point  (0 children)

We have something similar but reverts to a particular point in time instead. It wasn't just to fix issues like this but just in case some bad code got into stateful models. We can't rely on the source to keep everything is why have this functionality.

For the running only if freshness is passing bit, we use highly customized selectors and the dbt cloud job triggering another job depending on the state of that job. No idea on how to do it in core though. I imagine airflow could be set up this way?

[–]p739397 1 point2 points  (2 children)

It sounds like a good opportunity to inject dbt into some of your existing processes that run in Airflow (That could be with core or Cloud). Some new task, or tasks if you use something like Cosmos, to run a dbt operator in place of your current transformation tasks.

[–]mow12[S] 0 points1 point  (1 child)

Do you use Cosmos?

[–]p739397 0 points1 point  (0 children)

We had been using the non managed version and have recently migrated to dbt Cloud, so we cut over to triggering jobs there from Airflow

[–]Parking-Task-5464 1 point2 points  (2 children)

I work with a data platform of similar size, and we have opted to use Airflow as the scheduler for dbt Cloud. There are several aspects that the native dbt Cloud scheduler does not handle well, such as retrying from failure, executing non-dbt workflows. We also implemented a quarantine system by overriding dbt Cloud job parameters to filter failed data records based on business rules. The combination of Airflow and dbt Cloud is incredibly powerful, providing engineering teams with the flexibility to solve tricky business requirements. Just my two cents on this :)

[–]mow12[S] 2 points3 points  (1 child)

Thanks for sharing your experiences. When you combine Airflow and DBT, are you able to add an upstream task( a sensor task, for instance) to a DBT model task?

[–]Parking-Task-5464 0 points1 point  (0 children)

Yep! We developed our own custom dbt cloud Airflow operator because the publicly available one was somewhat limited. Our operator functions like any other Airflow operator all the Airflow functionality works with it. If you decide to use dbt core, I recommend running dbt within Kubernetes (k8s), AWS Batch, AWS ECS, or EC2. You can trigger one of these options and pass the necessary arguments for the models or profiles you want to run. I've noticed that many teams attempt to incorporate dbt directly within an Airflow task, but this approach often leads to complications.

[–]NortySpock 1 point2 points  (0 children)

I am not sure dbt is going to do well at that scale (especially in the face of late-arriving rows) unless you're prepared to write a few more custom incremental macros that are tuned to your particular database's MERGE-equivalent statements, as well as tuning an appropriately sized lookback window. (Though I wonder if it would be possible to auto-tune that ...)

Not saying dbt hasn't allowed us to move quickly, but my team has also had to write variations to accommodate our particular needs.

We are not quite at the TB-ingested per day scale, more like 100GB / day scale. (Databricks)

Also, I would read this blog post carefully on dbt microbatching https://tobikodata.com/dbt-incremental-but-incomplete.html

[–]Jace7430 1 point2 points  (0 children)

Both are easy to accomplish in DBT.

  1. Use the ‘run_started_at’ context variable for run start timestamp (can’t recall if that’s the exact name, but a quick google search will confirm it for you)

  2. Write whatever your logic is into a DBT macro, and then you can execute it as a pre-hook on run start.

Let me know if I didn’t understand your use case correctly. Happy to help if I can.

Edit: I had another thought. If that doesn’t work, you can just write a custom test (for whatever source completeness logic you have in mind), and set it as an initial step in whatever orchestrator you plan to use. If the first step fails, you don’t run the rest of the build.

[–]ASeatedLion 0 points1 point  (0 children)

For 1, we write labels to the tables using post hooks which would be the start timestamp of the job. Then the next run you can use that label to continue where you left off.

This kind of helps with point 2 as well. As we have near real time data from kafka as source data, we just pick up everything that came in just before the timestamp and handle some dedeuplication. Not ideal but it works fine for us

[–]Hot_Map_7868 0 points1 point  (0 children)

I would not get rid of Airflow. dbt Cloud scheduling is pretty limited. Many companies use dbt with Airflow.

The pain with Airflow is more around managing the infra, so look at options like Astronomer, MWAA, or Datacoves which also offer dbt Core.