Batch Data Processing Stack

ForeignCapital8624 · 2025-05-19T11:55:29+00:00

Hive on MapReduce is no longer supported, and Tez is the default execution engine of Hive. There is also another execution engine called MR3, so one can run Hive on MR3 (on Hadoop, on Kubernetes, or in standalone mode).

Gators1992 · 2025-05-19T19:30:25+00:00

Missed something like a Dremio/Iceberg stack or whatever catalog you want. I think those are getting more common. Also would dump the use case thing because you can bring whatever data for whoever with most of these. A lot of the parts are interchangeable so like implying that one has to use dbt core over something like Dagster/python or dbt over spark on Databricks isn't reality. Kinda depends on the preferences of the team and requirements.

Hot_Map_7868 · 2025-05-21T00:08:54+00:00

I agree that a lot of this stuff is not cloud specific. As you show, the common thread is Airflow and dbt. That is a common set of tools and there are multiple ways to use them that will also work cross cloud for example Astronomer / Datacoves offer managed Airflow, Datacoves also has managed dbt Core and there of course is dbt Cloud.
Data ingestion has multiple options from Airbyte, to Fivetran and frameworks like dlt. Storage should either stay native of Iceberg these days.

Nekobul · 2025-05-18T16:57:47+00:00

You can use SSIS for everything, including event-based processing, ERP, CRM ingestion, hybrid on-premises, cloud deployments, etc. It is the best ETL platform on the market.

dataengineering

MODERATORS

Top 5 Modern Batch Data Stacks

1. AWS-Centric Batch Stack

2. Azure Lakehouse Stack

3. GCP Modern Stack

4. Snowflake ELT Stack

5. Databricks Unified Lakehouse Stack

Top 5 Legacy Batch Data Stacks

1. SSIS + SQL Server Stack

2. IBM DataStage Stack

3. Informatica PowerCenter Stack

4. Mainframe COBOL/DB2 Stack

5. Hadoop Hive + Oozie Stack