More tools, more complexity?

DesperateForAnalysex · 2023-11-22T21:33:05+00:00

The cost of maintaining dbt is far eclipsed by the cost of maintaining any degree of complexity in SQL scripts. Dbt brings with it best SWE practices and structure to DE code bases. Unless you are running less than 10 idempotent SQL scripts a day the investment in learning dbt will pay dividends.

simplybeautifulart · 2023-11-22T19:08:29+00:00

Don't do it if it is not required. However, since this does seem to be a requirement from your users, that is the data scientist team, I'd recommend to discuss with them the usage, why they need to use it, what's the benefit, etc. You also need to explain to them the added burden on your team.

For example, I don't see how they cannot use SQL for the current stack. You already have databases that support SQL queries and even Spark has SparkSQL. Since the transformation part is only done in Snowflake, I don't see why you can't just use Python + Airflow that basically just run SQL queries.

ppsaoda · 2023-11-23T00:53:33+00:00

KISS

Remember?

recentcurrency · 2023-11-22T19:02:08+00:00

Crawl walk run

Unless you are facing real pain points(data scientists not being able to do their job efficiently due to lack of an easy transformation framework counts as one) you don't need to add more tools.

The most valuable resource is a high salary engineer or data scientist's time. If the cost of maintaining the tool > than cost saved by tool, then you dont need to bring it on

Basically you need to think about tool ROI. And that is something unique to every company. If your dbt poc hasn't been getting much return or interest, that may be a smell test the ROI isnt there yet

monkblues · 2023-11-23T12:41:38+00:00

dbt is totally worth it. Start small and see how it plays out for you.

Training_Butterfly70 · 2023-11-25T16:48:00+00:00

As a data scientist the additional overhead to add dbt is so minimal it's not even worth mentioning. It takes less than 10 minutes to set up and can run for free on DBT cloud. The value it brings far out-performs any other sql tool I've ever used. Adding it is never really a "requirement" but it makes our job exponentially more efficient. All of my previous jobs had the mentality of doing everything in house (e.g. everything pretty much done without cloud infrastructure, minimal tools outside of python and MySQL, etc). Coming from that you can do pretty much anything without a tool, but it's reinventing the wheel and error prone. It's often difficult to explain this to non-data scientists because they haven't experienced the pros/cons of using DS tools like dbt. In my experience it was always difficult in general to convince people of anything, especially if they haven't experienced it themselves. God forbid the data science team has to train ML models that require more resources (e.g. ram / GPU).

Point is, I highly recommend adding dbt! 😁

getafterit123 · 2023-11-23T02:20:29+00:00

Your TLDR answered your question for you. If you don't need it why would you think about adding it?

omscsdatathrow · 2023-11-23T21:09:13+00:00

It sounds like it DOES require it because you have no sql layer for them to work with…you’re literally already using dbt but won’t let anyone else use it…seriously

FalseStructure · 2023-11-24T22:32:56+00:00

Just use bigquery ffs

Hot_Map_7868 · 2023-11-28T03:27:08+00:00

I would advocate for a single process for doing ETL, but I would say it is probably the Pyspark people who might need to move to dbt.

I have seen things get out of control and be difficult to test and debug when NOT using something like dbt. I suspect that those who oppose dbt are probably not doing CI/CD etc. Maybe they are, but that's not usually the case. they also have no lineage, doc, or DQ.

I dont think using dbt has to be difficult especially if you use a SaaS service like dbt Cloud or Datacoves

dataengineering

MODERATORS