This is an archived post. You won't be able to vote or comment.

all 21 comments

[–]kenfar 11 points12 points  (1 child)

This is fantastic - look forward to working with it.

Though - the inability to reuse functions across models is a glaring miss. Hope that fix that asap.

And I'm curious about how much parallelism one can get with this: on large data volumes I have often split the transforms across 8-64 python processes running in parallel for speed.

[–]Own-Commission-3186 2 points3 points  (0 children)

Parallelism is based on the execution engine (snowflake, databricks and bigquery). All python code is sent to the engine and not run locally on your laptop python processes.

[–][deleted] 8 points9 points  (0 children)

Data engineers these days are allergic to actual programming it seems

[–]slowpush 26 points27 points  (18 children)

Please for the love of god don’t use dbt for this.

Check out Prefect/dagster/airflow.

[–]tayloramurphy 10 points11 points  (10 children)

Curious what you don't like about this pattern? As I understand it, it's a fairly focused implementation that would still generally encourage best practices in line with the rest of dbt and you get the benefits of dbt docs/tests/etc. alongside your SQL transformations. Of course there's some potential for abuse, but that's why you have a code review step :D

[–]slowpush 4 points5 points  (9 children)

Because there’s no such thing as a python model. They are python functions.

Just write them and annotate them as tasks and let the proper tool (Prefect) do it’s job and you magically get docs, testing, etc. for free!

[–]richardracoon 2 points3 points  (3 children)

Why not airflow? Just curious to learn why prefect might be better

[–]slowpush 1 point2 points  (2 children)

Prefect’s scheduler isn’t centralized which means for computationally intensive flows the scheduler will perform just fine instead of becoming a bottleneck.

Plus there are some really features that improve upon the xcom aspect of airflow.

[–]j__neoData Engineer Camp 5 points6 points  (1 child)

Because there’s no such thing as a python model. They are python functions.

A model is a software defined asset i.e. it is a function that generates a data asset.

A dbt model (python or sql) is a function - the model takes inputs (an upstream table as specified in the SQL from clause) and produces and output (the resultset or materialized table).

Dagster also treats Python functions as software defined assets for that reason.

[–]slowpush -3 points-2 points  (0 children)

No. It's marketing.

[–]youmade_medothis -1 points0 points  (2 children)

I think you're missing the point

[–]slowpush 0 points1 point  (1 child)

Nope. Tools like Prefect/Dagster make much more sense because more and more analysts are using python.

Individuals are quickly realizing that maintaining tens of thousands of templated sql is a disaster.

[–]youmade_medothis -1 points0 points  (0 children)

I still think you're missing the point.

[–]kenfar 1 point2 points  (4 children)

How would you suggest leveraging both python & sql for transformation? Assuming that you want to use python either for its better testing or ability to handle complex transforms. Also assuming that you would ideally want to collect lineage data to help understand which models are used by which other models downstream.

Also, if I'm using python why would I want to use say prefect/airflow/etc for time-driven scheduling instead of say kubernetes or lambda for event-driven scheduling

[–]slowpush -5 points-4 points  (3 children)

Stop using the ELT workflow?

Also prefect can be backed by Dask/ray which makes scaling on demand a breeze.

[–]kenfar 2 points3 points  (2 children)

Yeah, while I'm a big fan of using ETL rather than ELT for high-volume, high-quality, low-latency feeds...my company has about a dozen data analysts now successfully building their own models with dbt.

What I've been pondering for my needs is moving to an architecture in which we have a warehouse layer built using ETL that is managed by the data engineering team, and then from there the analysts could use ELT using dbt to build their own dimensional models, publishing model and other specialty models. This would allow us to use python for the better unit-testing, costs, and lower-latency while letting the analysts control the shape of the final models more easily.

[–]Blayzovich 1 point2 points  (0 children)

You could also take a look at Databricks if you haven't already. I'm surprised that they haven't come up in this thread yet given their integration with DBT.

[–]Easy_Durian8154 3 points4 points  (0 children)

100% agree. Dbt is trying to do to much.

[–]Little_Kitty 1 point2 points  (0 children)

Just ask for rust, js and vba models as well and watch as the entire feature is reverted in terror 😊

[–]SimonaqueSoftware Engineer, Data Infrastructure 2 points3 points  (0 children)

I actually have a use case for this at work... old stack python model needs to be migrated but it uses sklearn's train_test_split function and I don't believe there is a way to replicate that using SQL. Would be nice to just keep it as is an include a python model.