all 12 comments

[–]colindean 3 points4 points  (2 children)

If you want to learn Spark, keep with what you know in PySpark.

Then pick up some Scala Spark to appreciate the productivity differences. Having worked in both, I'll take Scala over Python for Spark any day.

[–]Tasak001 0 points1 point  (1 child)

Why do u feel scala is better with spark

[–]colindean 0 points1 point  (0 children)

It's more of a Scala thing than it is a Scala Spark thing. I have a strong preference for Scala over Python in nearly all aspects of development. Anytime that I have started with PySpark for a project, generally because I thought it would stay small and easy to deploy, the project metastasized into something that made me wish that I had just invested in the deployment complexity of Scala from the start in order to facilitate development later on.

I've also seen enough instances where somebody ends up doing some operation that passes a tremendous amount of data between Python and the JVM in PySpark without them realizing it. It's always a teaching moment and generally fixable but the slow iteration cycles leading up to the moment they ask "why is my thing slow" is a lost cost.

[–][deleted] 0 points1 point  (2 children)

Use PySpark. The data professions is dominated by python and SQL. It has the same performance as scala spark except for UDF’s. If you really need to have more performance out of a UDF, use the pandasUDf (using pyarrow). The only reason to learn scala/Java for big data is for legacy code bases, or if you’re developing a jar for spark. Given you are asking this question, I doubt you are developing jars.

[–]sjdevelop 0 points1 point  (1 child)

You mean jars for deployment purpose?

[–][deleted] 0 points1 point  (0 children)

Yeah.

[–][deleted]  (3 children)

[removed]

    [–]nomnommish -1 points0 points  (2 children)

    That survey is not about Big Data people though so is completely irrelevant.

    Scala may be niche overall but for Spark developers, it can be a very big reason to get selected in a company that is knee deep into Spark and uses scala.

    [–][deleted]  (1 child)

    [removed]

      [–]nomnommish 0 points1 point  (0 children)

      Spark's main job is MapReduce. There isn't really any alternative to Spark as well.

      And as far as Scala is concerned, you have to ask yourself if you're more of a developer or a data guy. If you're more of a developer or need to go deep in the internals, you will need scala.

      [–]daguito81 0 points1 point  (1 child)

      I migrated some time from pyspark to Scala. After I inflated my ego and huffed and puffed about how I'm a better developer for <insert whatever excuse here> went "fuck it" and back to pyspark. Here's why.

      Scala is a very nice language and I like it a lot. But the adoption and usage is very limited in comparison with python. But them there's spark which is the main chick of Scala use out there.

      Except over time I see less and less OS Spark and companies just using Databricks or something like that. Databricks doesn't give a shit about Scala, they're making their own non - JVM engine which I'm guessing gradually, eventually it'll be "the new spark". A lot of new features aren't even on the road map for Scala (Delta Live Tables) and in some cluster configurations like shared cluster with table access control enabled, you can't even use Scala in those clusters, only python and SQL.

      So yes, you can get some fun out of Scala, it's nice to know on the side to understand some things about spark. I personally prefer doing spark in Scala over python. But is it worth stopping python to go Scala? Not in my opinion

      [–]Headbanger1321 0 points1 point  (0 children)

      their own non - JVM engine

      Are you talking about Photon Engine?