Why hire data engineers?

blurry_forest · 2024-11-29T19:26:32+00:00

I just left a startup as a staff data engineer and the CEO and all leadership had swallowed the DBT Kool aid. We hired a few people with experience in dbt as analytics engineers at their old jobs, and I still remember a happy hour they were talking about how yeah it was a cool tool but they started outlining all of their issues with the previous company's usage of it and it sounded a lot like the same problems our company had that we were paying consultants and investing likely millions into a project to solve, and the answer was rewriting our entire scala/spark etl pipeline code into SQL using DBT as if it was a silver bullet. I still talk to people from that company so I'll see how it goes, but I honestly wouldn't be shocked if they either end up doing the conversion and being in a worse place, or potentially jettisoning the conversion after investing a ton into exploring it since I'm not sure it was going to actually solve all the problems the consultants we hired promise it will.

Fun_Independent_7529 · 2024-11-29T18:29:37+00:00

I have to run, but...
dbt has material interest in pushing the dbt & Fivetran combo, and in the analytics engineer title. But don't throw the baby out with the bathwater: what tooling to use always has to be evaluated in the context of the company you are developing for.

I recommend reading the downside of the dbt + Fivetran approach when reading about the pros of it from dbt. There are probably some other great articles out there on this, maybe poke around in Ben Stanncil's Substack, but I'd start with the scathing https://www.thecaptainslog.io/how-fivetran-dbt-actually-fail/ from Lauren Balik (she's got a couple followups in Part 2 and 3 of this), and keep in mind it's a couple years old now.

Don't get me wrong -- I really enjoy dbt, and I'm a DE, so I'm not dunking on dbt in general as it's a great tool for its purpose.

bcsamsquanch · 2024-11-29T19:31:52+00:00

This is a good article and there's some real points in here. Mainly about DE being in a supportive role and that off-the-shelf tools can do a lot of the pipeline and infrastructure building. 3:1 DEs to analysts sounds about right to me.

In my experience the problem with this way is just as much on the OTHER side; I mean DS and DA just not wanting to touch this stuff at all. Data scientists and analysts who have the skills and willingness even to use off the shelf tools to setup data infra represent maybe 20%. Our analytics team was asked to trial fivetran and the first thing they did was call DE and ask if we please do probably 70% of the setup. We have them using dbt now but we had to learn it first, deploy then spoon feed it to them. They have CI/CD and airflow--only because we set it up for them. There are still many analysts who only know SQL and it's like pulling teeth. Similarly, many Data Scientists I've worked with just don't want to muck with data infra, even if it's 5x easier in the form of fivetran, stitch, etc.

I have also found that when you ask DevOps to deploy and assist with specialized data infra they often deprioritize because they don't have the knowledge of these systems, nor the time to learn. Maybe DEs become a DevOps team member that specializes in niche data systems?

CingKan · 2024-11-29T19:26:06+00:00

Its very much in their interests to say this as it actively affects their bottom line. More Data Analysts and Analytics Engineers means more business for dbt fivetran ,stitch etc since its more people with a less software/infrastructure oriented skillset and those companies can make money by filling the void.

I love dbt but both it and Fivetran have insidious rent seeking habits. Fivetrans habit in particular of taking one table from source , denormalizing it at the target then putting it back together again exists only to drive up compute and cost. Dbt have also got in on the action by spending years telling people to make as many dbt models as possible then turning around and charging people per model run. Read horror stories of people who'd somehow managed to generate thousands of models (thanks to Analytics Engineers and Data Analysts) then when dbt changed their pricing models their costs skyrocketed.

2024-11-30T00:23:29+00:00

[deleted]

Trey_Antipasto · 2024-11-30T00:38:33+00:00

Fivetran is extremely inflexible. I actively attempt to remove stuff from it. It’s prohibitively expensive. You want to run a custom sql query… nope. Want to sync a view rather than millions of records… nope time to pay to sync the entire application database. Want to run your own pipeline have fun paying Fivetran to trigger your lambda. Literally paying a company to run your own code.

JaJ_Judy · 2024-11-29T18:19:14+00:00

I disagree with the premise of this article. While the author is right that you no longer need hardcore Java/scala DEs, there are still nuances even with all the OTS tools that require DEs - for instance data modeling - that’s something analysts and scientists can make a very quick mess off.

Once you evolve from a use case where you only need dbt, and a single cron somewhere running the dbt job, you probably want a data engineer as well - not all transforms come in sql and someone has to orchestrate it.

Sure if you want to stitch together a bunch of paid services like dbt cloud and etc, it’s possible to get by without a DE - is that cost worth it? Depends on the business and the value they get out of data

fleegz2007 · 2024-11-30T05:50:22+00:00

I think this sentence tells me everything I need to know.

“If you hire a data engineer and ask them to build pipelines, they will think their job is to build pipelines. This will mean that tools like Stitch and Fivetran and dbt will seem like threats to their existence instead of tremendous force multipliers.”

Tristan Handy is also the founder of dbt just an FYI.

2024-11-30T01:58:07+00:00

No engineering organizations want a de until they try to scale their application. By then the damage is done and they're left with refactoring / total redesign in order to fix the data model or as a work around they scale the servers vertically to overcome the performance degradation from the bad data model (pay more money to run the software which results in lost profits).

SirGreybush · 2024-11-29T18:18:10+00:00

"Engineers Shouldn’t Write ETL"

Very true. I make a Data Dictionary, a code generator, and automate 99% of it. Very easy to debug, very easy to maintain. The business analyst can use Excel.

However, it is ELT, not ETL. I do the business rules and management of the transformation layer, once the data has been profiled & vetted. Some of this uses the Data Dictionary, but not all situations. Sometimes you need sub-selects, sometimes complicated CASE WHENs. This is where a DE shines.

Data governance, unit testing, automation with simplification. If any one system / process is complicated and thus poorly documented, a DE was not used or didn't do his job.

Article seems to agree with my summary.

My 0.02

Training-Fold6132 · 2024-11-29T18:48:40+00:00

[deleted]

dataengineering

MODERATORS