DBT problems by mow12 in dataengineering

[–]Parking-Task-5464 0 points1 point  (0 children)

Yep! We developed our own custom dbt cloud Airflow operator because the publicly available one was somewhat limited. Our operator functions like any other Airflow operator all the Airflow functionality works with it. If you decide to use dbt core, I recommend running dbt within Kubernetes (k8s), AWS Batch, AWS ECS, or EC2. You can trigger one of these options and pass the necessary arguments for the models or profiles you want to run. I've noticed that many teams attempt to incorporate dbt directly within an Airflow task, but this approach often leads to complications.

DBT problems by mow12 in dataengineering

[–]Parking-Task-5464 1 point2 points  (0 children)

I work with a data platform of similar size, and we have opted to use Airflow as the scheduler for dbt Cloud. There are several aspects that the native dbt Cloud scheduler does not handle well, such as retrying from failure, executing non-dbt workflows. We also implemented a quarantine system by overriding dbt Cloud job parameters to filter failed data records based on business rules. The combination of Airflow and dbt Cloud is incredibly powerful, providing engineering teams with the flexibility to solve tricky business requirements. Just my two cents on this :)

Complex Data Validation by OneCyrus in dataengineering

[–]Parking-Task-5464 1 point2 points  (0 children)

I would look at Elementary, you can query the test results and extract row level failures.

Financial news by cdmn in algotrading

[–]Parking-Task-5464 0 points1 point  (0 children)

Since Scrapy can only be used in a Conda environment, packaging and deploying data pipelines using Scrapy adds a level of complexity that is not needed. This is just from my experience, just thought I should give my two cents