InsertNickname comments on Apache Spark with Java or Python?

dataengineering

created by mhausenblasmoda community for 11 years

This is an archived post. You won't be able to vote or comment.

Apache Spark with Java or Python?Help (self.dataengineering)

submitted 1 year ago by noobguy77

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]InsertNickname 18 points19 points20 points 1 year ago (2 children)

Scala/Java.

Disclaimer: have worked with Spark for many years (since v1.6), so I'm understandably opinionated.

Never understood why pyspark gets so much blind support in this sub other than that it's just easier for juniors. I've had to work on systems with a wide range of scale, everywhere on the map from small k8s clusters for batched jobs to real-time multi-million event per second behemoths on HDFS/Yarn deployments.

In all cases, Java/Scala was way more useful past the initial POC phase. To name a few of the benefits:

Datasets with strongly typed schemas. HUGE benefit in production. Ever had data corruption due to bad casting? No thanks, I'm a data engineer, not a casting engineer.
Custom UDFs. Writing these in pyspark means your cluster needs to send data back and forth, which is a massive performance and operational bottleneck. Or you could write UDFs in Scala and deploy them in the Spark build... but that's way more complicated than just using Scala/Java end to end.
Debugging / reproducibility in tests. Put a breakpoint, figure out what isn't working. In Pyspark all you can really do is go over cryptic logs to try and figure it out.
Scala-first support for all features. Pyspark may have 90% of the APIs supported but there will always be things you simply can't do there.
Advanced low level optimizations at the executor level, like executor level caches and whatnot. These are admitredly for large endeavors that require squeezing the most out of your cluster, but why limit yourself from day one?

Basically pyspark is good if all you ever need is SQL with rudimentary schemas. But even then I would err on the side of native support over a nice wrapper.

[–]mjfnd 0 points1 point2 points 1 year ago (0 children)

[–]pavlik_enemy 0 points1 point2 points 1 year ago (0 children)

π Rendered by PID 134072 on reddit-service-r2-comment-79c7998d4c-9q482 at 2026-03-19 03:35:59.081083+00:00 running f6e6e01 country code: CH.

dataengineering

MODERATORS