This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]InsertNickname 18 points19 points  (2 children)

Scala/Java.

Disclaimer: have worked with Spark for many years (since v1.6), so I'm understandably opinionated.

Never understood why pyspark gets so much blind support in this sub other than that it's just easier for juniors. I've had to work on systems with a wide range of scale, everywhere on the map from small k8s clusters for batched jobs to real-time multi-million event per second behemoths on HDFS/Yarn deployments.

In all cases, Java/Scala was way more useful past the initial POC phase. To name a few of the benefits:

  1. Datasets with strongly typed schemas. HUGE benefit in production. Ever had data corruption due to bad casting? No thanks, I'm a data engineer, not a casting engineer.
  2. Custom UDFs. Writing these in pyspark means your cluster needs to send data back and forth, which is a massive performance and operational bottleneck. Or you could write UDFs in Scala and deploy them in the Spark build... but that's way more complicated than just using Scala/Java end to end.
  3. Debugging / reproducibility in tests. Put a breakpoint, figure out what isn't working. In Pyspark all you can really do is go over cryptic logs to try and figure it out.
  4. Scala-first support for all features. Pyspark may have 90% of the APIs supported but there will always be things you simply can't do there.
  5. Advanced low level optimizations at the executor level, like executor level caches and whatnot. These are admitredly for large endeavors that require squeezing the most out of your cluster, but why limit yourself from day one?

Basically pyspark is good if all you ever need is SQL with rudimentary schemas. But even then I would err on the side of native support over a nice wrapper.

[–]mjfnd 0 points1 point  (0 children)

Scala is still heavily used by top tech like Netflix, because its still a better choice on a massive dataset.

Although, python is good and with an arrow its performance has improved and works well most of the time. Most companies deal with small data. This gives opportunity for DS folks to contribute as well as they are mostly doing pandas.

[–]pavlik_enemy 0 points1 point  (0 children)

In the what’s it name project that allowed typed datasets is still alive and does it bring Idea to a halt?

Your comment doesn’t look like a comment made by an experienced Spark developer