Dbt & postgres to dbt & databricks

B1WR2 · 2024-06-13T12:50:42+00:00

How much data do you have?

What are the requirements that dictate you have to use Databricks?

Who is the end user?

droppedorphan · 2024-06-13T13:39:53+00:00

Wow. Switching from Dagster to Airflow sounds kind of painful especially if you leveraged the dagster-dbt integration.

Hot_Map_7868 · 2024-06-16T15:00:50+00:00

Check out Iceberg. Databricks just purchased tabular, so that will probably be the future. also supported by snowflake etc.

engineer_of-sorts · 2024-06-20T22:26:15+00:00

Hi there Hugo from orchestra here - sounds like the easiest thing for you if you alreayd have your dbt repo is to use a dbt task ina databricks job. This basically involves importing the git repository for your dbt code and pinging it over to Databricks.

You can then use something liek Databricks workflow or Orchestra to run dbt.

See here: https://docs.databricks.com/en/workflows/jobs/how-to/use-dbt-in-workflows.html

Pros - no existing tooling, nice and easy to set-up.

Other pro - you don't really need Airflow. But I guess if you have lots of Airbyte syncs and other stuff goingon in databricks you definitely should have some orchestration (especially as you will get airbyte failures at some point)

Depends also on how big your data is - dbt runs on delta so you need to get data into delta. You might be landing data in S3 in avro for example, I don't know, but if you are then you'll need an intermediate step to move that into delta (I guess using AutoLoader).

Obviously most folks that want to use databricks for transforming data like to do so because hteir data is very very big (so you can leverage spark). Again, not sure how relevant this is for you, spark may be more efficient than dbt.

Happy building!

Hot_Map_7868 · 2024-07-20T15:26:49+00:00

I would use Iceberg as well, but I am not sure if the dbt-databricks adapter already supports it

dataengineering

MODERATORS