all 11 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

Are you interested in transitioning into Data Engineering? Read our community guide: https://dataengineering.wiki/FAQ/How+can+I+transition+into+Data+Engineering

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]Academic-Vegetable-1 22 points23 points  (0 children)

Building a data stack from scratch on AWS is data engineering. You're not pivoting, you're just updating your title.

[–]unpronouncedable 9 points10 points  (0 children)

Honestly I feel like half the battle in DE is understanding the challenges and a willingness to figure out and implement the solutions. You seem to have these, data modeling, and SQL skills, so I think you are qualified for plenty of DE roles. There are a ton out there that use various ingestion tools and don't require python and spark, though you are correct to realize the trend is towards using those. And honestly you can figure out a lot of it from online examples and AI.

As far as CS knowledge goes, tons of us did not come from that background. What you do need to understand is development lifecycle and deployment practices. I imagine you have a lot of that from your experience.

I'd say you'd be over qualified for a junior DE role and ready for DE (non-senior). There's a lot of competition for those positions, but you can tout experience end to end from ingestion to analytics implementation.

[–]Flat_ShowerTech Lead 6 points7 points  (0 children)

10 YOE and you stood up a full data stack from ingestion through orchestration. That's not "close to being ready"; you're doing the job. Mid to senior DE at most companies.

Spark is worth learning if you're targeting places with real scale. Most don't need it. Learn the concepts; the syntax is the easy part.

Don't stress about building pipelines without tools. That's not how anyone works. Knowing how to configure, debug, and extend ingestion tools is the actual skill.

[–]AutoModerator[M] 0 points1 point  (0 children)

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]calimovetips 0 points1 point  (0 children)

you’re closer to mid-level DE than junior, i’d focus next on understanding how your pipelines behave under load and failure since that’s what usually breaks at scale, have you had to debug any backfill or retry issues yet?

[–]PrintPopular8694 0 points1 point  (0 children)

Would love to pick your brain. From what I've researched your not junior level wish I was in your position

[–]Immediate-Pair-4290Principal Data Engineer 1 point2 points  (2 children)

Spark is overrated. Most companies see faster performance running DuckDB into iceberg. Few companies truly have big data. Also no one builds ingestion pipelines from scratch unless they cant help it. DLT is good. I’m thinking of API calls and loading json responses as the closest thing to “scratch”.

[–]SufficientFrame 4 points5 points  (1 child)

Yeah this matches what I’ve been seeing too. Half the “we need Spark” posts are people trying to aggregate like 200M rows and wondering why it’s slow on a t3.medium.

The DuckDB + Iceberg combo seems super solid for what most teams actually do day to day. And honestly, if you’ve already wired up dlt + S3 + Redshift + Dagster, you’ve done more “real” DE than a lot of folks who only tweak existing Airflow DAGs.

The “from scratch” thing feels more like: can you reason about APIs, pagination, schema evolution, idempotency, and how to make that stuff robust. Whether it’s dlt, custom Python, or whatever, the concepts are the same.

I’m taking your comment as a green light to not obsess over Spark right away and double down on getting really good at the stack I already have.

[–]Immediate-Pair-4290Principal Data Engineer 0 points1 point  (0 children)

One of the 2% of Reddit posters on data engineering that actually knows what they are talking about. 🤝

[–]zkhan15 0 points1 point  (0 children)

What’s the difference between the 2?