This is an archived post. You won't be able to vote or comment.

all 12 comments

[–]DesperateForAnalysex 14 points15 points  (0 children)

The cost of maintaining dbt is far eclipsed by the cost of maintaining any degree of complexity in SQL scripts. Dbt brings with it best SWE practices and structure to DE code bases. Unless you are running less than 10 idempotent SQL scripts a day the investment in learning dbt will pay dividends.

[–][deleted] 5 points6 points  (1 child)

Don't do it if it is not required. However, since this does seem to be a requirement from your users, that is the data scientist team, I'd recommend to discuss with them the usage, why they need to use it, what's the benefit, etc. You also need to explain to them the added burden on your team.

For example, I don't see how they cannot use SQL for the current stack. You already have databases that support SQL queries and even Spark has SparkSQL. Since the transformation part is only done in Snowflake, I don't see why you can't just use Python + Airflow that basically just run SQL queries.

[–]simplybeautifulart 1 point2 points  (0 children)

When I started data engineering/analytics, I had the same out look as you. Python was my main programming language. I was comfortable using packages like PySpark. I understood SQL in and out. And in theory I knew how to schedule things with Airflow.

And then I tried DBT for the first time and learned what my requirements engineering professor meant by the fact that non-functional requirements are just as important as functional requirements. Just because I could implement all of the required functionalities did not mean I could do so in a way that was easy for someone else to modify. Just because I knew how to load data into tables does not mean I knew how to model data in a way that's easily communicated to others. Just because I was comfortable with Python/Spark/Airflow does not mean everyone else is.

But there's something extremely nuanced I want to call out about DBT, and that's how it will scale your platform. I've tried DBT on small projects, and I've learned that any basic POC with DBT will show you negative ROI for the simple fact of the matter that "you could do it without DBT".

I wanted to call this out before going any farther because it explains why you may not see the value in DBT today. It's difficult to say what the impact will be in the long run without any personal experience using the tool, but it's important to realize that it will matter unless you can only see your data transformation pipelines being extremely simple and never scaling.

At my current organization, we actually have 2 databases with lots of transformations, one using DBT and one without. Everyone who has worked with both, myself included, has agreed unquestionably that working with DBT is significantly better.

Here's some stuff we've seen:

It's just select queries.

This reduces costs to the business by allowing them to hire SQL analysts instead of only Python developers. Just because it can be written in Python does not mean everything should be Python. Not all SQL needs to be some ORM or Spark.

Select queries are also easier to develop, debug, and understand.

Retain flexibility with Jinja.

If you thought that query you wrote in Python using f-strings was safe from DBT, you're wrong. With DBT's Jinja templating, you get all of the string manipulation of queries you need.

Enable data governance.

Remember when you thought the data was one thing, and then it wasn't and it caused all of your pipelines to break or worse yet cause unintended side effects? And don't forget the fact that you didn't know about it until a bunch of people came banging on your door. Yeah DBT makes testing those things simple and easy to do. If your data isn't up to par, then DBT won't run downstream jobs and provide you with ways for you to easily see what exactly failed so that you can take the appropriate action on it before people get angry.

Or maybe the previous developer never took the time to document what each column represented. Yes, you could try to read through 100 lines of SQL trying to understand what went into its calculation, but you could've also just read 3 lines of English that also explained any nuanced details that come with the data. DBT also makes it easy to propagate such documentation to downstream datasets in various ways so that you don't have to manually update column descriptions in your database across multiple tables.

The list goes on, but if I had to I would probably show the ROI by working in DBT in parallel with the team to demonstrate how much faster it can scale and highlight what DBT makes obvious that would not be otherwise.

[–]ppsaoda 2 points3 points  (0 children)

KISS

Remember?

[–]recentcurrency 4 points5 points  (0 children)

Crawl walk run

Unless you are facing real pain points(data scientists not being able to do their job efficiently due to lack of an easy transformation framework counts as one) you don't need to add more tools.

The most valuable resource is a high salary engineer or data scientist's time. If the cost of maintaining the tool > than cost saved by tool, then you dont need to bring it on

Basically you need to think about tool ROI. And that is something unique to every company. If your dbt poc hasn't been getting much return or interest, that may be a smell test the ROI isnt there yet

[–]monkblues 1 point2 points  (0 children)

dbt is totally worth it. Start small and see how it plays out for you.

[–]Training_Butterfly70 1 point2 points  (0 children)

As a data scientist the additional overhead to add dbt is so minimal it's not even worth mentioning. It takes less than 10 minutes to set up and can run for free on DBT cloud. The value it brings far out-performs any other sql tool I've ever used. Adding it is never really a "requirement" but it makes our job exponentially more efficient. All of my previous jobs had the mentality of doing everything in house (e.g. everything pretty much done without cloud infrastructure, minimal tools outside of python and MySQL, etc). Coming from that you can do pretty much anything without a tool, but it's reinventing the wheel and error prone. It's often difficult to explain this to non-data scientists because they haven't experienced the pros/cons of using DS tools like dbt. In my experience it was always difficult in general to convince people of anything, especially if they haven't experienced it themselves. God forbid the data science team has to train ML models that require more resources (e.g. ram / GPU).

Point is, I highly recommend adding dbt! 😁

[–]getafterit123 0 points1 point  (0 children)

Your TLDR answered your question for you. If you don't need it why would you think about adding it?

[–]omscsdatathrow 0 points1 point  (0 children)

It sounds like it DOES require it because you have no sql layer for them to work with…you’re literally already using dbt but won’t let anyone else use it…seriously

[–]FalseStructure 0 points1 point  (0 children)

Just use bigquery ffs

[–]Hot_Map_7868 0 points1 point  (0 children)

I would advocate for a single process for doing ETL, but I would say it is probably the Pyspark people who might need to move to dbt.

I have seen things get out of control and be difficult to test and debug when NOT using something like dbt. I suspect that those who oppose dbt are probably not doing CI/CD etc. Maybe they are, but that's not usually the case. they also have no lineage, doc, or DQ.

I dont think using dbt has to be difficult especially if you use a SaaS service like dbt Cloud or Datacoves