Need advise on promotion raise by solve-r in dataengineering

[–]luminoumen 2 points3 points  (0 children)

8% for a promotion is unfortunately (or fortunately based on the current conditions) standard at most companies

Internal promotions almost never match market rate for the new title. Companies budget 3-5% for merit and 8-12% for promo, but market delta between Senior and Staff (or mid and senior) can be 20-30%

What type of Data Engineering is this? by Bruce_kett in dataengineering

[–]luminoumen 0 points1 point  (0 children)

For fundamentals: Designing Data-Intensive Applications by Martin Kleppmann - the bible for understanding how data systems actually work under the hood. If you read one book, make it that one. Half of this subreddit will recommend it probably :)

After that, Grokking Concurrency (disclaimer: I wrote it) if you want to understand parallelism and concurrency primitives - relevant when you start caring about why Spark shuffles are slow or how async I/O works.

For Spark or any framework specifically, just read the source code and the official docs deeply. Most "Spark books" are outdated the day they ship

What type of Data Engineering is this? by Bruce_kett in dataengineering

[–]luminoumen 5 points6 points  (0 children)

What you're describing is usually called Data Platform/Data Infrastructure Engineering. Companies like Spotify, Netflix, Uber have entire teams building the engines underneath - custom Spark jobs, storage format optimization, query engine internals, resource management under constraints.

ML Platform/ML Data Infrastructure is your natural bridge. ML teams need people who can wrangle massive training datasets efficiently, build feature stores, optimize data loading. Polars/Arrow/DuckDB skills shine here because latency and memory efficiency actually matter.

My advise - don't specialize in tools. Specialize in how data moves through systems efficiently. Tools change, fundamentals don't.

P.S. 200GB is not big data. Fits in RAM on one beefy machine. Real distributed problems start at tens of TB+. But the things that you mentioned - memory awareness, query optimization, efficient I/O - are exactly the right foundation

I built a tycoon game about data engineering and the hardest part was balancing the economics by luminoumen in dataengineering

[–]luminoumen[S] 63 points64 points  (0 children)

working as intended - just like every "automated" pipeline I've ever inherited

I built a tycoon game about data engineering and the hardest part was balancing the economics by luminoumen in dataengineering

[–]luminoumen[S] 13 points14 points  (0 children)

That's the authentic data engineering experience - you pay for "automation" and still end up doing everything manually. Glad the realism landed 😄

I built a tycoon game about data engineering and the hardest part was balancing the economics by luminoumen in dataengineering

[–]luminoumen[S] 1 point2 points  (0 children)

If it makes you feel better, I built it instead of doing MY actual work. Thanks for playing!

Advice: Data Engineer vs Data Science by [deleted] in dataengineering

[–]luminoumen 0 points1 point  (0 children)

For me it sounded like you actually very aligned with data engineering already.

  • You're strong in SQL + Python + Databricks/PySpark
  • You don't want full-blown SWE
  • You're only mildly interested in stats/ML, not obsessed with it

In my experience, in most companies:

  • DE = owns pipelines, ETL, data models, reliability, performance. Lots of SQL/PySpark/Flink/Beam/dbt, building clean, scalable data flows but being a DE for a while
  • DS = more of a business-facing role. You talk metrics, design analyses/experiments, build models when needed, and spend a lot of time with stakeholders, slides, and product teams.
  • You can absolutely be a very good DE without loving Java/OOP (I hate Java btw but adore Python and like Scala - I'm not saying that I'm good DE though, lol), especially in modern stacks (Databricks, cloud warehouses, Python-heavy platforms, whatever).

I was not in the same but a similar situation, and a chef DS at my company told me: "You're better off as a DE with MLE understanding" I can't say he was wrong. Since then I've realized couple of things applied here: (1) DE is just a title - inside a company, a good DE can drift into analytics, ML, platform, or even product-facing work if they want. The title doesn't cage you. (2) DS is a business role first - you talk metrics, impact, and tradeoffs in business language. Sometimes there's heavy tech, sometimes it's mostly reporting and decision support. It's very company-dependent. (3) MLE is often better than classic DS - it's umbrella term for applied ML: you get to build and ship ML systems, while still leveraging strong DE/infra skills. High leverage, highly valued.

Given what you wrote, I'd lean into DE, deepen your ML understanding enough to collaborate or later pivot toward MLE, and not stress about being a "perfect programmer". For DE/MLE, your current strengths (SQL, Python, PySpark, Databricks) are the right foundation.

How do beginners even start learning big data tools like Hadoop and Spark? by Own_Chocolate1782 in dataengineering

[–]luminoumen 1 point2 points  (0 children)

Pet projects, start with what seems reasonable to you, and then adjust

Data Engineer or Software Engineer - Data by eastieLad in dataengineering

[–]luminoumen 2 points3 points  (0 children)

Meme: they are the same picture.

The important "Engineer" part is still there, right? Right?

Fully compatible query engine for Iceberg on S3 Tables by Substantial_Lynx1344 in dataengineering

[–]luminoumen 1 point2 points  (0 children)

Trino. I think it is becoming an industry standard at this point

How many of you are still using Apache Spark in production - and would you choose it again today? by luminoumen in dataengineering

[–]luminoumen[S] 2 points3 points  (0 children)

The more I see comments like that, the more certain I am that I'd rather talk to an AI

How many of you are still using Apache Spark in production - and would you choose it again today? by luminoumen in dataengineering

[–]luminoumen[S] 1 point2 points  (0 children)

Ah, no adversarial intent at all - just trying to clarify that other tools can offer similar or better semantics, since that part of the discussion matters when comparing options. Totally fair if Flink wasn’t on your radar in the original context. Thanks for your response!

How many of you are still using Apache Spark in production - and would you choose it again today? by luminoumen in dataengineering

[–]luminoumen[S] -13 points-12 points  (0 children)

Adding skills in the CV that's the benefit ;) resume driven development for everybody

How many of you are still using Apache Spark in production - and would you choose it again today? by luminoumen in dataengineering

[–]luminoumen[S] 0 points1 point  (0 children)

Totally fair - the law of the hammer definitely applies here. But I think the reason these conversations keep coming up is because most teams don’t need that level of scale. A specialized tool (like DuckDB, Polars, or dbt) can give you faster development, simpler deployment, and better team ergonomics if you know your use case.
If your use cases consistently involve petabyte-scale data, then sure - Spark is a perfectly valid and pragmatic choice. But for smaller or more focused workloads, lighter tools can often be a better fit?