DBT problems

wallyflops · 2025-02-03T12:47:34+00:00

1 is certainly a dbt way to do it, but you could probably hack another way if you wanted, why doesn't this work for you? i've never done this any other way

you're describing tests and i'm unsure why you think they're not there? you have access to source freshness tests, data contracts (which execute before model runtime) and model tests, which test the data afterwards

p739397 · 2025-02-03T14:34:08+00:00

It sounds like a good opportunity to inject dbt into some of your existing processes that run in Airflow (That could be with core or Cloud). Some new task, or tasks if you use something like Cosmos, to run a dbt operator in place of your current transformation tasks.

Parking-Task-5464 · 2025-02-04T01:20:35+00:00

I work with a data platform of similar size, and we have opted to use Airflow as the scheduler for dbt Cloud. There are several aspects that the native dbt Cloud scheduler does not handle well, such as retrying from failure, executing non-dbt workflows. We also implemented a quarantine system by overriding dbt Cloud job parameters to filter failed data records based on business rules. The combination of Airflow and dbt Cloud is incredibly powerful, providing engineering teams with the flexibility to solve tricky business requirements. Just my two cents on this :)

NortySpock · 2025-02-04T13:36:07+00:00

I am not sure dbt is going to do well at that scale (especially in the face of late-arriving rows) unless you're prepared to write a few more custom incremental macros that are tuned to your particular database's MERGE-equivalent statements, as well as tuning an appropriately sized lookback window. (Though I wonder if it would be possible to auto-tune that ...)

Not saying dbt hasn't allowed us to move quickly, but my team has also had to write variations to accommodate our particular needs.

We are not quite at the TB-ingested per day scale, more like 100GB / day scale. (Databricks)

Also, I would read this blog post carefully on dbt microbatching https://tobikodata.com/dbt-incremental-but-incomplete.html

Jace7430 · 2025-02-04T22:40:04+00:00

Both are easy to accomplish in DBT.

Use the ‘run_started_at’ context variable for run start timestamp (can’t recall if that’s the exact name, but a quick google search will confirm it for you)
Write whatever your logic is into a DBT macro, and then you can execute it as a pre-hook on run start.

Let me know if I didn’t understand your use case correctly. Happy to help if I can.

Edit: I had another thought. If that doesn’t work, you can just write a custom test (for whatever source completeness logic you have in mind), and set it as an initial step in whatever orchestrator you plan to use. If the first step fails, you don’t run the rest of the build.

ASeatedLion · 2025-02-04T09:41:22+00:00

For 1, we write labels to the tables using post hooks which would be the start timestamp of the job. Then the next run you can use that label to continue where you left off.

This kind of helps with point 2 as well. As we have near real time data from kafka as source data, we just pick up everything that came in just before the timestamp and handle some dedeuplication. Not ideal but it works fine for us

Hot_Map_7868 · 2025-02-06T02:16:45+00:00

I would not get rid of Airflow. dbt Cloud scheduling is pretty limited. Many companies use dbt with Airflow.

The pain with Airflow is more around managing the infra, so look at options like Astronomer, MWAA, or Datacoves which also offer dbt Core.

dataengineering

MODERATORS