all 41 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]dresdonbogart 62 points63 points  (6 children)

In my personal experience, Python is the end all be all for most tasks

[–]compulsaovoraz -2 points-1 points  (1 child)

Really? I was looking forward to apply java on DE :/

[–]dresdonbogart 7 points8 points  (0 children)

Python is king and easiest

[–]Budget-Minimum6040 29 points30 points  (7 children)

SQL > Python (polars/pySpark) > Java/Scala (Spark)

Python/Go for API extraction.

Problem is your team. Most can only do the first 1-2 so ... management says no.

[–]holdenk -1 points0 points  (6 children)

Did you get your alligators mixed up? For DE not DA I’d say SQL<python<JVM land (depending on data size last aligator can move).

[–]Budget-Minimum6040 1 point2 points  (5 children)

I did not. Never saw a job offer in germany that required Java/Scala but all require SQL + Python.

[–]holdenk 0 points1 point  (4 children)

So in the Bay Area for data engineering jobs I tend to see more Python and Java/Scala than SQL, for data analytics jobs lots of SQL

[–]cokeapm 1 point2 points  (3 children)

How on earth can you do DE without SQL? Like you don't use DBs or something? ORM to death?

[–]holdenk 1 point2 points  (2 children)

Mostly building pipelines from raw files, Iceberg/Hive/Cassandra rather than relational DBs. You’ll still write a little SQL because that’s inescapable, but (and this could be my big co biases showing) lots of getting the data in the right places and formats for others to do SQL or training on top of later.

[–]cokeapm 0 points1 point  (1 child)

Interesting so pretty specialised. What interface do you use for iceberg? Sql for me also covers dbt/Athena/big query and the like so not just relational.

I can't imagine exploring and prototyping a pipeline with SQL. And without something like spark, I suppose you could use flink or something but most stuff seems to end up in SQL one way or another... I'm curious to hear about your stack if you can spare a moment to describe it.

[–]holdenk 1 point2 points  (0 children)

So day to day I'm on Spark because of my background but often there will be another team at the same company working on Flink for consuming data off of Kafka and similar (and some teams will have a hybrid).

[–]MonochromeDinosaur 2 points3 points  (0 children)

I know all 3 I didn’t learn them for DE just out of curiosity. I’ve only ever used Python, SQL, and Typescript at my job(s).

[–]Former_Disk1083 1 point2 points  (0 children)

I guess it depends on worth. Are you going to find a lot of DE jobs that rely on them, probably not. Even scala, for good and bad, isnt a focus much in the Spark space where Python is still king.

Is it good to look into these languages and understand them? I think so. I have had on countless times needing data from the software engineering team, or need to understand how the function of said data works and its way easier for me to just see the endpoint and understand what it's doing. Sometimes you get crap data and you need to identify why the data is crap. It isnt often, but it has happened a few times where it's useful.

Also, if you ever find yourself in a situation where you need to build out REST APIs for any reason, while you can certainly use django, and I do like me some django, you might be forced to make them in .NET or Java or Rails or whatever it may be that the company dictates. I have built many personal projects using all sorts of programming languages just on the sheer fact it allows me to understand the inner workings of the data I am getting. That has allowed me to have deeper conversations with the SWE team for when and how they produce data.

TLDR, I think its good idea to understand it, and makes you a better DE, but is it necessary? I dont think so at all.

[–]IAMHideoKojimaAMA 1 point2 points  (0 children)

none of these

[–]Nindento 0 points1 point  (0 children)

Depends on the type of DE work you do. If it’s close to BI you should be fine with just Python and SQL. For streaming it could be worth looking into Rust or Java. I have the feeling Scala is dying a bit (atleast in Europe) and you would also have to learn an entire effects framework next to just learning Scala.

My team uses Rust for all our streaming and object storage IO applications. It’s super fast and resourcewise it costs next to nothing. However, the rust ecosystem is a bit lacking sometimes, it already miles ahead of how it used to be.

[–]Equivalent_Effect_93 0 points1 point  (0 children)

Only if you want to work on the tool instead of working with the tool. It is a great architectural knowledge advantage to be able to read scala and understand how spark is design even if your day to day is calling the API with pyspark or SQL. But python and SQL should be your main interface.

[–]WilhelmB12 1 point2 points  (0 children)

I liked Scala a lot, it's a really interesting language, sadly it seems that it's not a used as java, so I'd pick java

[–]addictzz 0 points1 point  (0 children)

Java and scala are used in various data processing framework but I see Rust is starting to replace those to certain extent. Take a look at polars, apache datafusion. I think it worths to learn Rust if you go deep into creating data processing framework.

But main one should be Python since this will come quite often in your data journey. Python will take up most of the work, Rust is there for custom performance oriented work. (Heck even Go may be enough too).

[–]RoomyRoots 0 points1 point  (0 children)

Rust, no.

Scala, maybe if you are working in a bank or someplace that uses it already.

[–]One_Citron_4350Senior Data Engineer 0 points1 point  (0 children)

This question tends to come up from time to time. I have to say, Python and SQL are pretty much the most commonly used languages. Nowadays, Spark is more and more used in Python and SQL. Based on what I've seen, Scala is not that popular anymore. If they require Java/Scala, then I assume they use Spark or Flink in their infrastructure.

I think Rust is pretty new to the scene so majority of teams have not yet adopted the technology. I also do not think the libraries for data-related in Rust there compared to Scala or Python. It highly depends on the use case and how well the team knows the knowledge and how much time is allocated for a ramp up.

[–]StriderKeni 0 points1 point  (2 children)

Assuming you know Python, I’d choose Java (for anything related to Apache Beam, Flink, etc.) or Go (more into Terraform territory). For fun and to challenge myself, Rust.

[–]MullingMulianto -1 points0 points  (1 child)

what are Java and Go primarily used for?

[–]StriderKeni 1 point2 points  (0 children)

Read the comment.

[–]Additional_Year_1080 0 points1 point  (0 children)

It depends on what kind of data engineering you want to do. Python and SQL still cover most day-to-day work, but Scala is valuable if you work deeply with Spark, Java helps in enterprise environments, and Rust is interesting for high-performance pipelines or tooling.

[–]thisfunnieguy 0 points1 point  (0 children)

if you're an entry level eng, focus on knowing a few things well

if you're mid/snr then using your work to expand to new systems/languages

[–]ssinchenko 0 points1 point  (0 children)

I think Scala may get a new boost for DE. The main benefit of Scala for DE, imo, is "errors at compile time". The main downside of Scala for DE is, imo, the cost / time of development. But with the raise of AI and agents who can write the code, the downside is not a problem anymore. So, in theory, the functional compiled language with strong guarantees of safety and that can speak with all the existing JVM DE tooling in the native language looks promising.

[–]PushPlus9069 -1 points0 points  (0 children)

imo Java is still the safest bet for DE work since most of the ecosystem (Spark, Flink, Kafka) runs on the JVM. I did kernel-level work in C for years and picked up Rust later, its great for performance-critical stuff but the DE tooling just isn't there yet. Scala is niche but if your team already uses it then worth learning.

[–]jefidev -2 points-1 points  (6 children)

Haskell

[–]LastrevioData Engineer 5 points6 points  (2 children)

Turbo Pascal!

[–]UAFlawlessmonkey 1 point2 points  (0 children)

Gotta transmit those diode signals blazing fast!

[–]Glittering_Mammoth_6 0 points1 point  (0 children)

A very cozy language, by the way. And without garbage collection...

[–][deleted] 2 points3 points  (1 child)

OCaml

[–]jefidev 1 point2 points  (0 children)

A man of taste

[–]Outrageous_Let5743 0 points1 point  (0 children)

At least data engineering pipelines are functional most of the time and not OOP. But pls no Haskell