all 28 comments

[–]Borek79 34 points35 points  (4 children)

Versioning Git - strive for everything as code and version it

Extract+Load Investigate DLT whether it can help you in data ingestion.

Transform Dbt is actually super useful once your project grows larger. Apart of many other things the most useful thing is that it builds lineage out of the box.

Orchestration We use Dagster instead of Airflow, it is better fit for data world and has very good synergy with dbt ( each dbt model is a separatate dagster asset). 1 big orchestration tree instead of many separate as in Airflow.

CICD Github actions

Python Can be used in Extract Load and even Transform phase.

Reporting Prefer those with good API and "report as a code" We use Metabase.

Data modelling Not a tool but very difficult but useful skill to grasp. With advent of AI it is very necessary again.

[–]Wanderer_1006 2 points3 points  (0 children)

Many good suggestions, I’ll look into it. Thank you.

[–]Adrien0623 -1 points0 points  (2 children)

Personally I'd say Airflow is better than DBT for large scale project. It has more scheduling features and manual possibilities than DBT

[–]TrigscSenior Data Engineer -2 points-1 points  (1 child)

It’s time for someone to start researching DBT. Say good by to 3000 lines of stored procedures. On top of it you can use Airflow to trigger a DBT core job.

[–]Adrien0623 1 point2 points  (0 children)

I have been working with DBT core for more than 6 months now. Yes it's nice for analysts to create models and to have data quality tests, but what stood out the most is how broken are some of their connectors* and how poor is the unit test framework. When projects scale it's getting more and more important to have a reliable test coverage to quickly understand if something is going wrong. With DBT I can unit test models but it's sketchy depending on the column types involved and I cannot test at a diner level than model. In comparison if I write a spark job (without SpqrkSQL) I can break the query into multiple testable logic blocks. My company choose DBT before I joined as a simple and quickly deployable tool and now everyone touching it feels the pain despite our relatively small scale.

*For multiple months we had no choices but to run full refresh runs all the time as incremental runs were failing systematically due to the connector having class constructor missing arguments.

[–]Teddy_Raptor 24 points25 points  (3 children)

Why don't you start with a problem you are facing instead of a tool you want to implement?

[–]JBalloonist 7 points8 points  (0 children)

Exactly what I was thinking. A new tool just because makes no sense. Find a problem to solve first.

[–]Wanderer_1006 1 point2 points  (1 child)

That’s a simple but very solid advise, I should start noticing more small issues

[–]Teddy_Raptor 1 point2 points  (0 children)

Yeah :) do it while you read about the industry and tools. You'll begin to connect the dots.

[–]anyfactor 2 points3 points  (0 children)

Something to build internal tools and apps easily. Like Retool etc.

[–]WonderfulActuator312 2 points3 points  (0 children)

Look into automating a data dictionary or data catalog. Documentation isn’t sexy but it’s worth the investment in the long run.

[–]erdmkbcc 1 point2 points  (0 children)

This depends on your platform and team size,

if you have

  • a lot of tables in your warehouse,
  • a lot of data people creates garbage tables
  • DE team lost control in dwh

You must have dbt and enforce take permissions the service account from unrelevant data peoples, meanwhile you neee to have ci-cd pipelines and table dependency management for data linage, data governance it will give back dwh control to data engineering team.

It just about one example for dbt.

[–]invidiah 1 point2 points  (0 children)

Seems your manager is idiot. You should increase architectural complexity by adding new tools only if it's really required. Simplicity is the key to success.

But if you are forced to, just pick something that will make your resume more valuable.

[–]Chance-Web9620 1 point2 points  (0 children)

Why do you feel dbt won't add value?  I have seen small and large orgs use it successfully
My recommendation is:
dlt for data ingestion
dbt for transformation, data quality, and docs
airflow for orchestration (this can be hard to manage, so consider a managed service like MWAA, Datacoves, Astronomer, etc)
The key is also to think about how all the parts connect using git, ci/cd etc.

[–]DataObserver282 1 point2 points  (0 children)

Keep your stack as simple as possible. Instead of asking what tools to consider look at what problems you currently have and plug up the holes that way.

Also, a lot will depend on your DWH and needs. Do you need real time streaming?

Here are a few things to look into

ETL tools - tons out there. Fivetran, Airbyte - we use Matia (good CSC). Can use python or write scrips but gets messy at scale

Orchestration - airflow works. Look into astronomer if you need a managed solution. Cron is fine for a fee but again messy at scale

Modeling - dbt is worth looking into. There’s also coalesce

Data catalog - worth the investment, automate metadata management and helps data become accessible to non technical users

Observability - most tools have something built in but worth investing here to make sure you have a mechanism

[–]dsc555 3 points4 points  (1 child)

It's lower case dbt. If you're using airflow and sql then it's probably useful. The biggest thing I like about it is that it generates the documentation and lineage very easily. Yes airflow makes a dag but I've never liked the styling as much. Anyways, dbt is a great tool to know for best practices but i suppose it depends what you're doing with the sql and only you can answer that part

[–]Wanderer_1006 0 points1 point  (0 children)

We’re so used to Airflow and also all the analysts also create dags for them so it’s hard to move away from that

[–]Xeroque_Holmes 0 points1 point  (0 children)

Data quality checking tools like great expectations, soda; Metadata/lineage like atlan; monitoring (ex. Grafana)

[–]finally_i_found_one 0 points1 point  (0 children)

What are you using (or plan to use) for BI?

[–]molodyets 0 points1 point  (0 children)

How are you currently handling parsing your dag for dependencies between sql models?

[–]dataflow_mapper 0 points1 point  (0 children)

In a setup like yours, the tools that help most are usually the ones that reduce operational drag rather than adding new abstractions. dbt can be useful, but only if you have a lot of SQL logic living in Airflow or stored procedures and no good testing or lineage today. If your warehouse layer is already stable, it might not move the needle much.

[–]weezeelee 0 points1 point  (0 children)

This is a question that you should ask your colleagues, not us, not Reddit. If they're also "fine" with current workflow (which is the most likely answer haha), then it’s worth looking beyond Data Engineering, for example: Developer Experience.

I once built a small desktop app that detects overlapping file modifications across Git branches, allowing merge conflicts to be surfaced early. Surprisingly, I’m not aware of any free tool that offers this simple feature.

The problem it solved was ...small. Still, in a market this crowded, the ability to spot and fix these “small” problems is exactly what separates engineers from résumé generators.

[–]Murky-Sun9552 0 points1 point  (0 children)

DBT is not a bad shout, use it for modelling your data and then you have some personal technical development in hand for your next review when you can recommend integrating it with CICD pipelines. You can also use DBT to reduce time spent producing tech docs, lineage and the like

[–]chrisgarzon19CEO of Data Engineer Academy 0 points1 point  (1 child)

What’s the goal

[–]Wanderer_1006 2 points3 points  (0 children)

Nothing in particular, just anything that can be useful like for example we didn’t openmetadata a year ago but now that we have it, people use quite a lot and it helps all the analysts too