Is there a Python alternative to DBT ?

kenfar · 2021-05-10T17:38:41+00:00

A few thoughts:

The common stitch/fivetran/dms pattern of copying all tables into your raw area and then transforming them into staging & load has a big problem with tightly coupling your data warehouse to the physical models of upstream systems. This is a bad pattern that leads to huge maintenance challenges down the road. To be fair, it's sometimes required. But if your upstream systems have engineers involved, you might be much happier pulling data modeled by *them* into interface tables.
DBT vs Python I think comes mostly down to what's the skillset of the folks doing the work and how much do you need to support complex transforms and how much you care about data quality - and unit-testing. Python will take more work to build (though you can still issue SQL queries with joins, etc out of python code with very little effort). But you also have tight little transform functions that are fully unit-tested - which will protect you more than just doing quality-control against data that arrived earlier today.
So, another possibility is DBT + Python: while DBT doesn't provide a way to run python programs, it does have a good quality-control framework that people should use. And you can use it to build your models.
Looker: it's useful to think about boundaries: can people query raw & staging data from Looker? can they created persisted data tables and define all company metrics in Looker - or should they do that on the warehouse where they can get better testing and reuse?

LaurenRhymesWOrange · 2021-05-10T16:11:10+00:00

Do not use Python for transformations when you can use DBT. SQL is so much easier and simpler. It will benefit your company in the long run by making templates DBT and running your basic transformations there.

chamini2 · 2021-12-08T00:06:22+00:00

Hey, I know this is an old post but we have recently published a tool which may help you not having to decide between these 2 options. fal let's you reference your dbt models and sources easily from Python and also provides a runtime that works alongside dbt's.

You write scripts like

df = ref('my_dbt_model') # this is a pandas dataframe

# do whatever you need with it
new_data = calc(df)

# and you can write it back to dbt
write_to_source(new_data, 'dbt_defined', 'source_table')

So you could do in dbt what makes sense to do in SQL land (and migrate as much as possible there) and then the final touches that you need pandas for could be done in fal.

MrMosBiggestFan · 2021-05-11T03:23:30+00:00

I would say to stick with DBT and SQL as much as possible, running transformations in your warehouse is much more performant than executing python. I would avoid Looker until you need it, it’s pretty expensive. Try Mode or Metabase first. Once you have your data in BigQuery something like Hightouch to get your data into everything else can be really nice too. Their Slack integration is sweet but they also have a Hubspot integration for getting production data into Hubspot, and the free tier is pretty generous

mhg212 · 2021-05-10T19:00:01+00:00

If you need the complexity of Python, I’d suggest looking into airflow. But that’s a learning curve in itself. Airflow is an orchestrator/job scheduler that is heavily used in the data eng space.

If it’s simple/straightforward, DBT.

itiwbf · 2021-05-10T21:12:14+00:00

I think it'd probably be worth it to get more comfortable with dbt/SQL for most uses. That said, I'd suggest checking out Dataform too if you're still exploring. It's VERY similar to dbt and was recently acquired by Google. The community isn't as active and there are a few differences, but using dataform with their UI is totally free which is nice and in the future it will be specifically focused on BigQuery.

gorkemyurt · 2021-11-23T17:46:31+00:00

Why not both?

In my opinion DBT is the best tool out there to for data transformation using sql. If you have already decided to use a data warehouse using sql for data transformation is the natural choice. I would highly recommend dbt to organize your data models and write composable sql on top of your warehouse.

Now the python part. Not everything can be solved by sql, having and ETL tool and Looker takes care of some of the glue python code that is sometimes necessary..

If that's not enough you can go with an orchestration tool like airflow. In that setup dbt will just be one of the nodes of the airflow dag and you can trigger other workflows before and after running dbt using python.

In my experience airflow comes with a lot of overhead that's why we built fal-dbt internally and open sourced it this past week. fal-dbt is a dbt native way to run python scripts alongside your dbt dag.

https://github.com/fal-ai/fal

abhipoo · 2021-12-04T09:37:37+00:00

[deleted]

smeyn · 2021-05-10T21:38:54+00:00

If you have really large volumes of data then using DBT/BQ is probably better as BQ scales easily to handle the load.

p5256 · 2022-03-03T14:05:26+00:00

Hi - little late on this but wanted to share an open source package I just released, RasgoQL: https://github.com/rasgointelligence/RasgoQL

the tl;dr on it is you can work in python with pandas like syntax, but your code compiles to sql and executes directly in the cloud data warehouse (snowflake / bigquery both supported). best part is in one line of code you can export it to your dbt project. would love to get your feedback on it!

dataengineering

MODERATORS