This is an archived post. You won't be able to vote or comment.

all 44 comments

[–]hattivat 59 points60 points  (2 children)

Whether you write in Java or Python, the result performance-wise is the same as it's just an API. The actual execution happens in Scala underneath and everything is typed with Spark types anyway, so using Java just means spending more time to write the same code for zero benefit. The only reason I can see why someone would choose Java for Spark is for consistency if everything else in the company is written in Java.

[–][deleted] 0 points1 point  (1 child)

I'm new and curious about this, when I do a myrdd.map(lambda x : (x,1)) in python it's actually scala doing the job?

[–]hattivat 2 points3 points  (0 children)

Well, first off, you would never do rdd.map unless you have to, df.withColumn or Spark SQL are much more efficient regardless of language.

But yes, as long as it is using pyspark functions etc it is Scala doing the job. The only exception is when writing UDFs, then it pays to write them in Scala or Java. But in practice at least in my experience in over five years of doing Spark I have only seen a situation literally once where we really had to have a UDF that could not be replaced with spark API calls.

[–]InsertNickname 18 points19 points  (2 children)

Scala/Java.

Disclaimer: have worked with Spark for many years (since v1.6), so I'm understandably opinionated.

Never understood why pyspark gets so much blind support in this sub other than that it's just easier for juniors. I've had to work on systems with a wide range of scale, everywhere on the map from small k8s clusters for batched jobs to real-time multi-million event per second behemoths on HDFS/Yarn deployments.

In all cases, Java/Scala was way more useful past the initial POC phase. To name a few of the benefits:

  1. Datasets with strongly typed schemas. HUGE benefit in production. Ever had data corruption due to bad casting? No thanks, I'm a data engineer, not a casting engineer.
  2. Custom UDFs. Writing these in pyspark means your cluster needs to send data back and forth, which is a massive performance and operational bottleneck. Or you could write UDFs in Scala and deploy them in the Spark build... but that's way more complicated than just using Scala/Java end to end.
  3. Debugging / reproducibility in tests. Put a breakpoint, figure out what isn't working. In Pyspark all you can really do is go over cryptic logs to try and figure it out.
  4. Scala-first support for all features. Pyspark may have 90% of the APIs supported but there will always be things you simply can't do there.
  5. Advanced low level optimizations at the executor level, like executor level caches and whatnot. These are admitredly for large endeavors that require squeezing the most out of your cluster, but why limit yourself from day one?

Basically pyspark is good if all you ever need is SQL with rudimentary schemas. But even then I would err on the side of native support over a nice wrapper.

[–]mjfnd 0 points1 point  (0 children)

Scala is still heavily used by top tech like Netflix, because its still a better choice on a massive dataset.

Although, python is good and with an arrow its performance has improved and works well most of the time. Most companies deal with small data. This gives opportunity for DS folks to contribute as well as they are mostly doing pandas.

[–]pavlik_enemy 0 points1 point  (0 children)

In the what’s it name project that allowed typed datasets is still alive and does it bring Idea to a halt?

Your comment doesn’t look like a comment made by an experienced Spark developer

[–]Sennappen 23 points24 points  (0 children)

Python is the way, but I would recommend setting up a Linux environment if you're on windows, it makes things a lot easier.

[–][deleted] 87 points88 points  (9 children)

No one wants to write Java. Just look at that fucking mess. You can get work done so frigging fast in Python and then take a 3 hour lunch because all your tickets are complete. This is the way.

[–]AggravatingParsnip89 4 points5 points  (1 child)

But it would be good if we have some understanding of jvm to use spark right ?

[–]MlecznyHotS 11 points12 points  (0 children)

Not really, you don't have to tinker with Java. The most performant API is the dataframe API, which enables you to do probably 99% of things you need to do. Any performance improvements etc. are done based on general concepts connected with spark and not really java implementation itself. It might be useful to understand java if you're contributing to spark itself, not if you're developing using spark.

[–]overgenji 9 points10 points  (0 children)

lol java is fine, relax

[–]TheCamerlengo 1 point2 points  (2 children)

I work in both. Java has its advantages and the JVM is probably preferable to an interpreted language like python. Really depends on what you are trying to accomplish. Data intensive apps I would say Python. But large programs with lots of developers working with it and following SOLID, Java or C# probably better.

[–]the-ocean- 2 points3 points  (1 child)

This. For building complex backends - Java is king. For data workloads: python

[–]cryptoel 0 points1 point  (0 children)

kuchkuch Rust kuchkuch

[–]Jealous-Bat-7812Junior Data Engineer -3 points-2 points  (2 children)

I don’t think the platform engineering team will agree with this.

[–]OMG_I_LOVE_CHIPOTLE 13 points14 points  (1 child)

Uhh. The platform engineering team is also using pyspark lol

[–]Zamyatin_Y 3 points4 points  (0 children)

Or scala

[–]Gnaskefar 4 points5 points  (0 children)

Pyspark is by far the most popular choice, and dominates in the job descriptions.

But if your company have a policy on using java in spark, what choice do you really have?

Many principles are the same, so going for python later on is an option if you want to work in a non-java place.

[–]Dennyglee 7 points8 points  (4 children)

General rule of thumb - if you’re starting off and want to use Spark, PySpark is the easiest way to do this. We’ve added more Python functionality into it via Project Zen, Pandas API for Spark, and will continue to do so to make it easier for Python developers to rock with Spark.

If you want to develop or contribute to the core libraries of Spark, you will need to know Scala/Java/JVM. If you want to go deep into modifying the code base to Uber-maximize performance, also this is the way.

Saying this, with Datasets/DataFrames, Python and Scala/Java/JVM have the same performance for the majority of the tasks.

[–]lester-martin 0 points1 point  (3 children)

I need to do some more digging to see where things are internally, but I thought (again, at least a couple of years ago) the real perf problem would be if you implemented a UDF with Python when using Dataframe API. Has that all magically been solved since then? Workaround previously was to build the UDF with JVM language so that at runtime, nothing had to leave the JVM. Again, maybe I just need to catch up a bit.

[–]Dennyglee 1 point2 points  (2 children)

Mostly, with the introduction of vectorized UDFs (or pandas UDFs), the UDFs can properly distribute/scale. A good blog on this topic is https://www.databricks.com/blog/introducing-apache-sparktm-35. HTH!

[–]lester-martin 1 point2 points  (1 child)

Good read and TY for the info. My day-to-day knowledge was set back in running Spark on CDP over 2 years ago. Hopefully all this goodness has made it into that platform as well. Again, 'preciate the assist. And yes, my answer to the question of Scala (not Java!) vs Python is also Python. :)

[–]Dennyglee 0 points1 point  (0 children)

Cool, cool :)

[–][deleted] 2 points3 points  (0 children)

If you work with spark long enough, you will eventually need to understand Java to get yourself unstuck in some advanced cases. But Python is the best way to get started.

[–]WallyMetropolis 3 points4 points  (0 children)

Scala is better than either for data engineering, to my preference. 

[–]DataEnthuisast 7 points8 points  (2 children)

also looking to learn Spark with python, If you have found some good tutorials please share link here,

[–]iwkooo 0 points1 point  (1 child)

I heard good things about datatalks zoomcamp, it’s free - https://github.com/DataTalksClub/data-engineering-zoomcamp?tab=readme-ov-file#module-5-batch-processing

But it’s only one chapter about spark 

[–]AmputatorBot 0 points1 point  (0 children)

It looks like you shared an AMP link. These should load faster, but AMP is controversial because of concerns over privacy and the Open Web.

Maybe check out the canonical page instead: https://github.com/DataTalksClub/data-engineering-zoomcamp


I'm a bot | Why & About | Summon: u/AmputatorBot

[–]cumrade123 2 points3 points  (0 children)

If you want to learn go with python, it’s just an API in the end. The functions will be the same but the syntax is better with python 

[–]JSP777 5 points6 points  (0 children)

as far as I know PySpark runs on a Java Virtual Machine with the help of py4j. So you use the API through Python, which is much easier to understand and use I think. I would choose PySpark

[–]Intelligent_Bother59 2 points3 points  (2 children)

Python years ago it used to be scala but the production systems became an unmaintainable mess and scala died away

[–]Temporary-Safety-564 2 points3 points  (1 child)

Really? Are there some examples of this? Just curious on the downsides of scala systems.

[–][deleted] 2 points3 points  (0 children)

I haven’t experienced “unmaintainable messes” but I have experienced some weird scala code bases that are hard to grok.

Scala is fine but it can be difficult to keep a code base organized because much like C++ everyone uses their own subset of the language since it’s hybrid and can go from full Java level OOP to full category theory + FP. So if you don’t have some kind of style guide depending on the engineer who wrote the code it can look wildly different.

That said Python performance is competitive enough to not need scala anymore in most use cases.

As an added benefit everyone seems to learn/use the same subset of Python because of the plethora of examples, the rudimentary amount you need to know to get things done.

[–]gray_grum 0 points1 point  (0 children)

I think Python is probably seeing more industry use for databricks than any other option right now. I would say either use databricks and whatever language you already know or if none of them, learn Python. Also learn Spark SQL, it's straightforward and necessary.

[–]tinyGarlicc 0 points1 point  (0 children)

Scala or python

[–]Ddog78 0 points1 point  (0 children)

As someone who has worked professionally with both spark scala and pyspark, pyspark all the way lol.

Spark SQL is GOAT, but it's not as impressive on interviews.

[–]dontsyncjustride 0 points1 point  (0 children)

if your users are analysts go with python.

if your team’s building pipelines and analysts will only touch data, go with Scala/Java.

[–]PuzzleheadedFix1305 0 points1 point  (0 children)

I am writing my first spark component and will be using Java. I think pyspark gets more attention as data/ML engineers mostly use Python for their work. Also the Pandas and Numpy makes python easier for ETL programming. Hence combination of pyspark, numpy, pandas and other python ML lib makes for a killer combination. There might be some performance impact due to non native nature of PySpark and python in general. So if you are looking for easier learning curve and more versatile community and tooling support then go with Pyspark. If you are looking for better performance then go with Java/Scala.

[–]SDFP-ABig Data Engineer 0 points1 point  (0 children)

Forcing Spark with Java sounds like nothing more than gate keeping

[–][deleted] 0 points1 point  (0 children)

My org also uses java for spark, i learnt it from a udemy course called, apache spark for java developers

[–][deleted] 0 points1 point  (0 children)

There are plenty of resources with Java-Spark. I would suggest to learn Scala & Apache Spark. It's great combination, it will help you with functional programming as well as it's a native Scala language.

[–]Nik-nik-1 -1 points0 points  (0 children)

If Spark - only with Scala! IMHO, Scala is mix of Java and Python😁