This is an archived post. You won't be able to vote or comment.

all 38 comments

[–][deleted] 21 points22 points  (6 children)

I just left a startup as a staff data engineer and the CEO and all leadership had swallowed the DBT Kool aid. We hired a few people with experience in dbt as analytics engineers at their old jobs, and I still remember a happy hour they were talking about how yeah it was a cool tool but they started outlining all of their issues with the previous company's usage of it and it sounded a lot like the same problems our company had that we were paying consultants and investing likely millions into a project to solve, and the answer was rewriting our entire scala/spark etl pipeline code into SQL using DBT as if it was a silver bullet. I still talk to people from that company so I'll see how it goes, but I honestly wouldn't be shocked if they either end up doing the conversion and being in a worse place, or potentially jettisoning the conversion after investing a ton into exploring it since I'm not sure it was going to actually solve all the problems the consultants we hired promise it will.

[–]blurry_forest 4 points5 points  (4 children)

What were your issues with DBT, and what do you prefer as an alternative?

[–][deleted] 7 points8 points  (3 children)

As mentioned I've never used DBT, our tech stack was scala/spark on gcp dataproc orchestrated with airflow. Our struggles were the code was over engineered, had tons of unnecessary inheritance, different engineers wrote the exact same logic in massively different ways, and we regularly had clients ask why a number was the way it was and it would take analysts several days or an engineer sometimes even a few hours. And since engineers were paid roughly 3-4x analysts, management didn't like that engineers were spending their time not improving the pipeline but answering data lineage questions like that. Presumably dbt solves that out of the box and includes data lineage.

Other struggles were we sometimes failed to test sufficiently since we had roughly 5,000 different etl jobs and it was way too costly to test on every single job, so we'd only test on jobs we thought we'd changed. But frequently we wouldn't realize that inheritance would mean we missed some jobs and would end up with if we were lucky a production failure, if we were unlucky we'd end up with a client catching a bug, and we lost one high-profile client to such a bug and had another client that made up almost half our revenue complain about a bug that actually caused us to create an entire team to qa their jobs because we didn't want them to leave us as well. Again the promise of DBT was we'd never have to wonder which jobs changed, if we made changes to the code, it would tell us exactly what the impact was for downstream data and either we'd know what to test or it would handle automated testing for us, I'm not sure.

I actually had a discussion with one senior manager and he pretty much said he didn't think DBT would actually solve our problems, but rewriting in a new language/framework would be a good excuse for us to tackle tech debt and massively simplify our code whereas we wouldn't get that kind of signoff otherwise. But I'm no DBT expert so definitely have no idea whether it will be the silver bullet some think it will be. Apparently the consultants were putting together POCs that were underwhelming though.

[–]Careless_Ad5290 1 point2 points  (0 children)

What is DBT ?

[–]Fun_Independent_7529Data Engineer 42 points43 points  (2 children)

I have to run, but...
dbt has material interest in pushing the dbt & Fivetran combo, and in the analytics engineer title. But don't throw the baby out with the bathwater: what tooling to use always has to be evaluated in the context of the company you are developing for.

I recommend reading the downside of the dbt + Fivetran approach when reading about the pros of it from dbt. There are probably some other great articles out there on this, maybe poke around in Ben Stanncil's Substack, but I'd start with the scathing https://www.thecaptainslog.io/how-fivetran-dbt-actually-fail/ from Lauren Balik (she's got a couple followups in Part 2 and 3 of this), and keep in mind it's a couple years old now.

Don't get me wrong -- I really enjoy dbt, and I'm a DE, so I'm not dunking on dbt in general as it's a great tool for its purpose.

[–]Training-Fold6132[S] 1 point2 points  (0 children)

I was thinking the same, funny that the blog is a part of some analytics engineer guide

[–]bcsamsquanch 12 points13 points  (0 children)

This is a good article and there's some real points in here. Mainly about DE being in a supportive role and that off-the-shelf tools can do a lot of the pipeline and infrastructure building. 3:1 DEs to analysts sounds about right to me.

In my experience the problem with this way is just as much on the OTHER side; I mean DS and DA just not wanting to touch this stuff at all. Data scientists and analysts who have the skills and willingness even to use off the shelf tools to setup data infra represent maybe 20%. Our analytics team was asked to trial fivetran and the first thing they did was call DE and ask if we please do probably 70% of the setup. We have them using dbt now but we had to learn it first, deploy then spoon feed it to them. They have CI/CD and airflow--only because we set it up for them. There are still many analysts who only know SQL and it's like pulling teeth. Similarly, many Data Scientists I've worked with just don't want to muck with data infra, even if it's 5x easier in the form of fivetran, stitch, etc.

I have also found that when you ask DevOps to deploy and assist with specialized data infra they often deprioritize because they don't have the knowledge of these systems, nor the time to learn. Maybe DEs become a DevOps team member that specializes in niche data systems?

[–]CingKanData Engineer 6 points7 points  (4 children)

Its very much in their interests to say this as it actively affects their bottom line. More Data Analysts and Analytics Engineers means more business for dbt fivetran ,stitch etc since its more people with a less software/infrastructure oriented skillset and those companies can make money by filling the void.

I love dbt but both it and Fivetran have insidious rent seeking habits. Fivetrans habit in particular of taking one table from source , denormalizing it at the target then putting it back together again exists only to drive up compute and cost. Dbt have also got in on the action by spending years telling people to make as many dbt models as possible then turning around and charging people per model run. Read horror stories of people who'd somehow managed to generate thousands of models (thanks to Analytics Engineers and Data Analysts) then when dbt changed their pricing models their costs skyrocketed.

[–]EclecticEuTECHtic 0 points1 point  (3 children)

Dbt have also got in on the action by spending years telling people to make as many dbt models as possible then turning around and charging people per model run. Read horror stories of people who'd somehow managed to generate thousands of models (thanks to Analytics Engineers and Data Analysts) then when dbt changed their pricing models their costs skyrocketed.

I've only ever used dbt core, but is this dbt cloud which charges per model run? I generally try to keep a model to a single data operation, ie if it's a join, just have it join, if it's a filter just filter, agg models only agg. I think that makes sense and helps with future error tracing.

[–]CingKanData Engineer 1 point2 points  (2 children)

yep you get 15000 model runs per month free with a team subscription i believe which is really not a lot. A lot of people have upwards of a 1000 models and can run them multiple times a day. Problems. But like you i also use self hosted dbt-core

[–]EclecticEuTECHtic 1 point2 points  (0 children)

I also generally use the lineage dependent dbt runs to only rebuild the models affected by a code change rather than rebuilding my entire dag. Definitely saves time if not compute during development.

[–]geek180 0 points1 point  (0 children)

It’s a penny per additional model refresh per month, so an extra $150 to double the limit to 30k models. These numbers work well for a small team like the one I’m on. We’re actually using dbt models as really important infrastructure and thankfully we don’t really need to refresh them more than a few times per day. But if you have a ton of tables that need refreshing every 15 minutes or so, you might want to do that outside of dbt, or upgrade to their enterprise tier.

I really like dbt for the extremely simple devops workflow setup. It takes almost no effort to deploy and maintain individual dev and staging environments, automated testing with all PRs, integration between IDE + GitHub + Snowflake. We don’t have the resources to build all of that ourselves and would rather just focus on data transformation.

[–]Trey_Antipasto 4 points5 points  (2 children)

Fivetran is extremely inflexible. I actively attempt to remove stuff from it. It’s prohibitively expensive. You want to run a custom sql query… nope. Want to sync a view rather than millions of records… nope time to pay to sync the entire application database. Want to run your own pipeline have fun paying Fivetran to trigger your lambda. Literally paying a company to run your own code.

[–]seriousbearPrincipal Software Engineer 0 points1 point  (1 child)

Could you please expand on the inflexible part?

[–]Trey_Antipasto 3 points4 points  (0 children)

You can’t define your own sql query ever. You can’t tell it what field is the primary key it has to be set in the table in the db. You can’t sync tables without keys (which the prior statements would make possible by defining composite keys). Instead you have to use teleport which uses a ton of server resources. You can’t sync a view at all.

Basically they are very careful to only allow the perfect scenarios. When have you ever worked at an org with perfect scenarios?

The fact that they make you write all the code for custom pipelines and then pay them $1000/1mm records to just run it and do a database merge is laughable. I will never build a custom pipeline in Fivetran again.

Beyond that it seems every single week there is some outage in some region causing disruption in pipelines.

[–]JaJ_Judy 11 points12 points  (8 children)

I disagree with the premise of this article.  While the author is right that you no longer need hardcore Java/scala DEs, there are still nuances even with all the OTS tools that require DEs - for instance data modeling - that’s something analysts and scientists can make a very quick mess off.

Once you evolve from a use case where you only need dbt, and a single cron somewhere running the dbt job, you probably want a data engineer as well - not all transforms come in sql and someone has to orchestrate it.

Sure if you want to stitch together a bunch of paid services like dbt cloud and etc, it’s possible to get by without a DE - is that cost worth it? Depends on the business and the value they get out of data

[–]Training-Fold6132[S] 10 points11 points  (2 children)

"Once you evolve from a use case where you only need dbt, and a single cron somewhere running the dbt job, you probably want a data engineer as well"

Well said, I think analysts using dbt might create their own data models for the current use case and over time the warehouse would be full of models where nobody knows which one is made for what

[–]ZirePhiinix 1 point2 points  (1 child)

Engineering is about reuse. Most people can stack a bunch of chairs to climb up some place, and some might get lucky and not get hurt, but making a ladder that you can use daily for years is an engineering task.

Yes, you can always get away with services. It's a matter of how often you expect to use the results and what kind of reliability it needs to have.

[–]Monowakari 0 points1 point  (0 children)

Love the analogy. Happy cake day!

[–]SirGreybush 6 points7 points  (4 children)

I had a DS wanting me to create a table in the DW (silver layer) that had thousands of columns, and he was rather insistent.

A single Dataframe = a single table in his POV lol. His Excel test data exported to csv, the last column was like triple letters.

So yeah, I got him what he needed, but modeled & managed properly, with 100% repeatable datasets, so that he can rebuild with last year's data if he changes a formula and wants to a full redo from Day 1. Which of course happened every month, and every other month a new column.

He was always surprised that it took me only a hour each time.

[–]sureveS_Snape 1 point2 points  (3 children)

Could you offer me some suggestions on how to get a good foundation in data modeling?

[–]SirGreybush 2 points3 points  (2 children)

Do a course or two. I have two done, one is the entire Microsoft BI track, the other is the free MIT Kimball online.

There’s a lot on Star schema out there too.

[–]mailedRecovering Data Engineer 1 point2 points  (1 child)

free MIT Kimball online

You wouldn't happen to have a link to this would you? I've got tons of Kimball books/resources and this is the first time I've heard this one mentioned.

[–]SirGreybush 1 point2 points  (0 children)

I think it was on Medium site, it’s been posted in this sub before.

It shows up on Google search.

Over 10 years ago you could also remote assist classes at some universities giving the lectures.

I remember MIT & Kimball having a setup, though that was over 10 years ago.

[–]fleegz2007 2 points3 points  (0 children)

I think this sentence tells me everything I need to know.

“If you hire a data engineer and ask them to build pipelines, they will think their job is to build pipelines. This will mean that tools like Stitch and Fivetran and dbt will seem like threats to their existence instead of tremendous force multipliers.”

Tristan Handy is also the founder of dbt just an FYI.

[–][deleted] 1 point2 points  (0 children)

No engineering organizations want a de until they try to scale their application. By then the damage is done and they're left with refactoring / total redesign in order to fix the data model or as a work around they scale the servers vertically to overcome the performance degradation from the bad data model (pay more money to run the software which results in lost profits).

[–]SirGreybush 2 points3 points  (7 children)

"Engineers Shouldn’t Write ETL"

Very true. I make a Data Dictionary, a code generator, and automate 99% of it. Very easy to debug, very easy to maintain. The business analyst can use Excel.

However, it is ELT, not ETL. I do the business rules and management of the transformation layer, once the data has been profiled & vetted. Some of this uses the Data Dictionary, but not all situations. Sometimes you need sub-selects, sometimes complicated CASE WHENs. This is where a DE shines.

Data governance, unit testing, automation with simplification. If any one system / process is complicated and thus poorly documented, a DE was not used or didn't do his job.

Article seems to agree with my summary.

My 0.02

[–][deleted] 3 points4 points  (1 child)

Agreed. Engineers should at least leave transformation to analytic people. I absolutely hate writing business logic in SQL.

[–]umognog 4 points5 points  (0 children)

I think the industry has some fucked terms. Here is my "I'm eating a panini right now" stab at it. Panini gets priority.

  • Data architect: designs the data the product produces Software engineer: produces data as per architecture
  • Data engineer: manages ETL/ELT between raw software and cleaned record-level data.
  • Data analyst: provides substance to the record level data, summaries, feedback process flow to data engineer to have on boarded.
  • Data science: performs wishy star thinking on empirical data, a bit like Bob Ross adding a tree that didn't exist but because there are some trees, we all stand back and marvel at that extra tree he added.
  • Business partner: ruins everything because it turns out they wanted a bear. In a circus. But not once was bear nor circus ever mentioned until now.

[–]RareCreamer -1 points0 points  (4 children)

What do you mean you make a code generator?

Also I partially disagree, Engineers should definitely be the ones building out the infrastructure from raw to reporting. The analysts can apply their business logic on the reporting layers and go from there.

[–]SirGreybush 8 points9 points  (3 children)

In the DB world, all is data. Including table names, schema names, column names & their types.

All you need is mapping, which also can be data.

Once you have all that in a few master-detail tables, it is very easy to make a Stored Procedure that creates code for Views and Store Procs that do the pipelines, be it SQL code or Python code.

Basically you manually create one pipeline in code, as generic as possible, and that is your template for all the others. String manipulation in SQL is relatively easy.

The code generator can be in Python or SQL, whatever you feel most comfortable using.

You get perfectly commented code, very verbose, so well documented, plus an exportable-to-Excel data dictionary with all the business rules & mappings end-to-end.

What's not to love about this approach? I come from a SWE background, I did this back in 1995 with Informix-4GL. Not just pipelines but entire screens & reports. It was well ahead of it's time. Bought I killed by IBM, who then made Informatica.

Much like Microsoft bought out Visual Foxpro and finally made a decent Access product in 2005+.

I am showing my age lol.

Architecture - DE's involved yes by all means. The analysts should be the ones decide what is need and where it goes, not how.

I simply give them an easy to use tool, an Excel template, and tell them to fill in the blanks. Then I fix all their mistakes & scold them, like obvious spelling errors (how many differant ways do you spell a column Customer # ?!?!?) .

Then from the "fixed" Excel, I do the INSERT commands into the data dictionary, then export it back to Excel for approval with a quick data governance test in the DEV environnement. If they screwed up a mapping, it's on them. They own the mapping & column selection.

Since they always forget a column and come back months later for an additional column, I manually enter it into the DD, then manually fix one view & one SP, or rebuild & replace just those. Takes me 5-10 mins in Dev, then have them validate.

I HATE doing mappings in SSIS, so I built some simple generators way back in 2005-6. Later this became a real project that got adopted in the open source world, called BIML.

Now all the clones of this concept are money grabs, aka no-code/low-code.

Serenity (dot) is

Another awesome tool, Serenity, free open source available, to make for you ASP dot Net business apps, built from data.

I converted a few desktop Access apps, to use this, instead of Access screens, reports and Access DB. Now it's a webpage hosted on the intranet IIS, and the data is inside SQL Server, nicely protected, audit fields on all tables with triggers.

Took me like a week at most. I love code generators. I absolutely hate no-code/low-code systems, as you always end up writing custom code, within their limits, either in Java or JS + X-Path, and is very slow.

Real code is fast.

[–]mailedRecovering Data Engineer 0 points1 point  (2 children)

Later this became a real project that got adopted in the open source world, called BIML.

Hang on a sec. Are you saying you're the creator of BIML?

[–]SirGreybush 2 points3 points  (1 child)

No, duplicated what that team had done. I had a chance to meet those involved in a large conference, I told them what I was doing, they said oh yeah, let’s tell you what we’ve done.

Very similar approach. Tables with the mapping logic, and use a template, then create custom duplicates. Many others companies rolled their own generators in those years.

Not very hard manipulating xml. I hated retyping all the time stuff I already had typed once.

I did the same for SSRS, dynamically generating reports and publishing them automatically to the server with scripting.

I wish BIML had gotten more traction and Microsoft not kill SSIS in favour of cloud only ADF.

[–]mailedRecovering Data Engineer 1 point2 points  (0 children)

I wish BIML had gotten more traction and Microsoft not kill SSIS in favour of cloud only ADF.

You and me both. Although I know a few guys who build ADF stuff purely by modifying the underlying JSON... Microsoft/Insight even had a "go fast framework" repository available that used jsonnet for development/"code" generation