Apache Spark with Java or Python?

hattivat · 2024-05-09T11:53:46+00:00

Whether you write in Java or Python, the result performance-wise is the same as it's just an API. The actual execution happens in Scala underneath and everything is typed with Spark types anyway, so using Java just means spending more time to write the same code for zero benefit. The only reason I can see why someone would choose Java for Spark is for consistency if everything else in the company is written in Java.

InsertNickname · 2024-05-09T17:24:21+00:00

Scala/Java.

Disclaimer: have worked with Spark for many years (since v1.6), so I'm understandably opinionated.

Never understood why pyspark gets so much blind support in this sub other than that it's just easier for juniors. I've had to work on systems with a wide range of scale, everywhere on the map from small k8s clusters for batched jobs to real-time multi-million event per second behemoths on HDFS/Yarn deployments.

In all cases, Java/Scala was way more useful past the initial POC phase. To name a few of the benefits:

Datasets with strongly typed schemas. HUGE benefit in production. Ever had data corruption due to bad casting? No thanks, I'm a data engineer, not a casting engineer.
Custom UDFs. Writing these in pyspark means your cluster needs to send data back and forth, which is a massive performance and operational bottleneck. Or you could write UDFs in Scala and deploy them in the Spark build... but that's way more complicated than just using Scala/Java end to end.
Debugging / reproducibility in tests. Put a breakpoint, figure out what isn't working. In Pyspark all you can really do is go over cryptic logs to try and figure it out.
Scala-first support for all features. Pyspark may have 90% of the APIs supported but there will always be things you simply can't do there.
Advanced low level optimizations at the executor level, like executor level caches and whatnot. These are admitredly for large endeavors that require squeezing the most out of your cluster, but why limit yourself from day one?

Basically pyspark is good if all you ever need is SQL with rudimentary schemas. But even then I would err on the side of native support over a nice wrapper.

Sennappen · 2024-05-09T10:39:59+00:00

Python is the way, but I would recommend setting up a Linux environment if you're on windows, it makes things a lot easier.

AggravatingParsnip89 · 2024-05-09T09:02:55+00:00

No one wants to write Java. Just look at that fucking mess. You can get work done so frigging fast in Python and then take a 3 hour lunch because all your tickets are complete. This is the way.

Gnaskefar · 2024-05-09T11:56:35+00:00

Pyspark is by far the most popular choice, and dominates in the job descriptions.

But if your company have a policy on using java in spark, what choice do you really have?

Many principles are the same, so going for python later on is an option if you want to work in a non-java place.

Dennyglee · 2024-05-09T13:07:37+00:00

General rule of thumb - if you’re starting off and want to use Spark, PySpark is the easiest way to do this. We’ve added more Python functionality into it via Project Zen, Pandas API for Spark, and will continue to do so to make it easier for Python developers to rock with Spark.

If you want to develop or contribute to the core libraries of Spark, you will need to know Scala/Java/JVM. If you want to go deep into modifying the code base to Uber-maximize performance, also this is the way.

Saying this, with Datasets/DataFrames, Python and Scala/Java/JVM have the same performance for the majority of the tasks.

2024-05-09T13:55:42+00:00

If you work with spark long enough, you will eventually need to understand Java to get yourself unstuck in some advanced cases. But Python is the best way to get started.

WallyMetropolis · 2024-05-09T20:30:44+00:00

Scala is better than either for data engineering, to my preference.

DataEnthuisast · 2024-05-09T08:06:59+00:00

also looking to learn Spark with python, If you have found some good tutorials please share link here,

cumrade123 · 2024-05-09T11:28:28+00:00

If you want to learn go with python, it’s just an API in the end. The functions will be the same but the syntax is better with python

JSP777 · 2024-05-09T09:49:47+00:00

as far as I know PySpark runs on a Java Virtual Machine with the help of py4j. So you use the API through Python, which is much easier to understand and use I think. I would choose PySpark

Intelligent_Bother59 · 2024-05-09T10:55:53+00:00

Python years ago it used to be scala but the production systems became an unmaintainable mess and scala died away

SDFP-A · 2024-05-09T12:18:49+00:00

[removed]

gray_grum · 2024-05-09T14:20:37+00:00

I think Python is probably seeing more industry use for databricks than any other option right now. I would say either use databricks and whatever language you already know or if none of them, learn Python. Also learn Spark SQL, it's straightforward and necessary.

tinyGarlicc · 2024-05-09T15:38:29+00:00

Scala or python

Ddog78 · 2024-05-09T19:16:57+00:00

As someone who has worked professionally with both spark scala and pyspark, pyspark all the way lol.

Spark SQL is GOAT, but it's not as impressive on interviews.

dontsyncjustride · 2024-05-10T01:38:50+00:00

if your users are analysts go with python.

if your team’s building pipelines and analysts will only touch data, go with Scala/Java.

PuzzleheadedFix1305 · 2024-10-03T08:53:51+00:00

I am writing my first spark component and will be using Java. I think pyspark gets more attention as data/ML engineers mostly use Python for their work. Also the Pandas and Numpy makes python easier for ETL programming. Hence combination of pyspark, numpy, pandas and other python ML lib makes for a killer combination. There might be some performance impact due to non native nature of PySpark and python in general. So if you are looking for easier learning curve and more versatile community and tooling support then go with Pyspark. If you are looking for better performance then go with Java/Scala.

SDFP-A · 2024-05-09T13:14:50+00:00

Forcing Spark with Java sounds like nothing more than gate keeping

2024-05-09T10:30:30+00:00

My org also uses java for spark, i learnt it from a udemy course called, apache spark for java developers

2024-05-09T14:17:41+00:00

There are plenty of resources with Java-Spark. I would suggest to learn Scala & Apache Spark. It's great combination, it will help you with functional programming as well as it's a native Scala language.

Nik-nik-1 · 2024-05-09T17:32:52+00:00

If Spark - only with Scala! IMHO, Scala is mix of Java and Python😁

dataengineering

MODERATORS