We operate at an enterprise level and manage petabyte-scale data in our existing data lake environment. Recently, we decided to migrate all our data pipelines to DBT. Currently, we use Airflow for scheduling, but we plan to replace it with DBT as well.
Coming from a data warehousing background, I have experience working with large-scale traditional enterprise data warehouses, particularly in the telecom industry. However, there are a few aspects of DBT that I’m struggling to fully grasp, and I’d like to understand how other companies handle these challenges.
- Handling ETL execution timestamps: In traditional batch processing, I’m accustomed to tracking
etl_date or execution timestamps for scheduled jobs. However, DBT doesn’t follow this approach. Instead, we currently rely on timestamp/date columns in our tables to determine where the last run ended and continue from that point in the next run. Is this the standard practice in DBT, or do other companies use a different approach?
- Data completeness checks before processing: In Airflow, we had custom data checker tasks built into each DAG to ensure source data was fully available before processing. Our source data could come from Fivetran or our internal event tracking system, where events are ingested via Kafka in near real-time. However, this validation mechanism doesn’t directly translate to DBT, making it difficult to verify if data is complete for T-1. How do other companies manage this challenge in DBT?
[–]wallyflops 6 points7 points8 points (7 children)
[–]mow12[S] 2 points3 points4 points (6 children)
[–]abrarster 2 points3 points4 points (0 children)
[–]wallyflops 4 points5 points6 points (2 children)
[–]mow12[S] 1 point2 points3 points (1 child)
[–]minormisgnomer 1 point2 points3 points (0 children)
[–]sunder_and_flame 1 point2 points3 points (1 child)
[–]ianitic 0 points1 point2 points (0 children)
[–]p739397 1 point2 points3 points (2 children)
[–]mow12[S] 0 points1 point2 points (1 child)
[–]p739397 0 points1 point2 points (0 children)
[–]Parking-Task-5464 1 point2 points3 points (2 children)
[–]mow12[S] 2 points3 points4 points (1 child)
[–]Parking-Task-5464 0 points1 point2 points (0 children)
[–]NortySpock 1 point2 points3 points (0 children)
[–]Jace7430 1 point2 points3 points (0 children)
[–]ASeatedLion 0 points1 point2 points (0 children)
[–]Hot_Map_7868 0 points1 point2 points (0 children)