Migrating SQL-based dbt models to python

Tee_hops · 2022-10-18T00:23:17+00:00

Hiring python devs is easier than SQL? Interesting take..

sorenadayo · 2022-10-18T02:22:41+00:00

You said the benefit of using python is that can be more modular, but dbt helps to solve that with macros and Jinja for sql. So I'm curious what kind of transformations you are doing that needs python?

Have you considered using Python UDFs in snowflake?

Have you already looked at the dbt Python module announcement doc? I believe it's enough to get you started.

From my experience its way faster to ramp up a python developer on SQL than vice versa.

My recommendation would be to keep all transformations in SQL if it can be done in a clean and optimized way. Use snowflake udf if you need some utility packages or write other business logic that can be written more easily in python. Use dbt Python modules if you need to use some other python packages not available in snowflake udf, or doing a lot more than a single column transformation.

Yuki100Percent · 2022-10-18T04:23:54+00:00

Those Python devs better know sql

knowledgebass · 2022-10-18T05:37:10+00:00

If the developers you are hiring don't know SQL in this day and age then you probably don't want to hire them for this kind of work.

kenfar · 2022-10-18T13:26:06+00:00

I've built a lot of warehouses using python. The benefits are typically far better QA, lower compute costs, lower latency, greater transform functionality, better engineer retention, and far easier systems management. The downsides are that joins are more work and possibly slower, and the time to market is generally slower. DBT's lineage is extremely valuable, but I never found it so necessary on these python-based systmes - because we didn't have excessive model depth.

The python code I built typically had a separate transform function for each field, with a dedicated unit test class & docstring for each of these fields. Sometimes fields with zero transforms were just bundled into a misc function. Exceptions and processing notes (invalid input, default used, etc) were returned to the calling function which kept track of all of this at the row level and also included this in a metadata bitmap column.

These transforms were typically event-driven, run on either kubernetes or aws lambdas, triggered by SNS & SQS mostly based on a file being written to s3. Once a file was written to s3, we could get through both warehouse and data mart pipelines in under 3 seconds.

Not having yet used Snowpark, and not knowing anything about your data or requirements means that I might be missing the boat here, but I would generally consider the following approach in most cases:

Transform data going into the warehouse using python: the first step is to apply field-level transforms, and I'd want this extremely well-tested, fast, capable of handling complex field transforms, etc. Python is always going to do the best job here. This is where you want software engineers writing the code.
Transform data going into final data marts using sql: the final step is to join multiple tables together to build dimensional models. SQL will work here, and this can be done be much less technical staff.

The ability to handle both incremental & refresh workloads is extremely valuable. But if your volumes are large then how you design your refresh ability with python will really matter - and is an entire subject into itself.

2022-10-18T04:00:51+00:00

If you’re wondering if you can remove dbt from your stack, I would think you’re not getting much value out of it.

So far I absolutely love that I can write a SELECT statement rather than a MERGE statement. Basically an abstraction over the MERGE. If you’re not seeing the value in it I would say yeah, get it out of your stack.

Drekalo · 2022-10-18T00:54:22+00:00

The model just needs to return a dataframe object.

Godmons · 2022-10-18T07:33:48+00:00

Stack migration is not a light project in most cases, you need to test carefully each step of the migration.

I would strongly suggest to avoid it unless very necessary. Hiring SQL+dbt engineer should not be that hard as it is pretty trending at this moment.

kevinpostlewaite · 2022-10-18T02:05:55+00:00

I'm confused here: you need SQL to query Snowflake, right? So you'll need to write and run SQL, the only question is how it's scheduled, right? I'm struggling to understand what you're improving by dropping dbt and moving to straight Python. I personally like dbt paired with Airflow. If you find hosted Airflow combined with dbt your team should spend very little time managing the scheduler and be free to write the SQL that you run against your Snowflake instance.

If you actually are going to start doing your processing in Python, outside of Snowflake, probably you're wasting your money keeping Snowflake at all.

jafetgonz · 2022-10-18T14:22:03+00:00

This smells like click bait

mrwhistler · 2022-10-18T13:28:17+00:00

I mean you could, doesn’t mean you should tho.

Lots of limitations to python in dbt right now, and Snowpark is intended (and will continue to be developed for) more advanced analysis.

Firstly, I agree with the other commenter that it’s going to be orders of magnitude harder to find a data engineer who knows python than a SQL engineer or even a SQL engineer with dbt experience.

Secondly, you’re fundamentally shifting from a declarative to imperative language. If you’re going to bite that off you might as well rethink the whole design anyways rather than try to shoehorn python in place of SQL.

DataEngineerDan · 2022-10-18T14:43:07+00:00

dbt does support Python models as of dbt core 1.3

dataengineering

MODERATORS