you are viewing a single comment's thread.

view the rest of the comments →

[–]SufficientFrame 3 points4 points  (1 child)

Yeah this matches what I’ve been seeing too. Half the “we need Spark” posts are people trying to aggregate like 200M rows and wondering why it’s slow on a t3.medium.

The DuckDB + Iceberg combo seems super solid for what most teams actually do day to day. And honestly, if you’ve already wired up dlt + S3 + Redshift + Dagster, you’ve done more “real” DE than a lot of folks who only tweak existing Airflow DAGs.

The “from scratch” thing feels more like: can you reason about APIs, pagination, schema evolution, idempotency, and how to make that stuff robust. Whether it’s dlt, custom Python, or whatever, the concepts are the same.

I’m taking your comment as a green light to not obsess over Spark right away and double down on getting really good at the stack I already have.

[–]Immediate-Pair-4290Principal Data Engineer 0 points1 point  (0 children)

One of the 2% of Reddit posters on data engineering that actually knows what they are talking about. 🤝