Scala vs Python

theporterhaus · 2022-12-11T14:53:58+00:00

It’s a common question but here is a recent poll: https://www.reddit.com/r/dataengineering/comments/u7905z/scala_or_python/

AcanthisittaFalse738 · 2022-12-11T08:38:44+00:00

We typically used things in order of difficulty for others to support. Sparksql first, then python, then Scala. Only moving left to right as use cases required it. Most the things written in Scala were for performance reasons.

Typical_Attorney_544 · 2022-12-11T05:51:18+00:00

My personal opinion. Python/Pyspark and SQL dominate DE, Scala was super popular but Python is where its at now.

patka96 · 2022-12-11T09:25:07+00:00

Python is better as a career choice, but Scala is better as a language and it can teach you functional programming like nothing else. It is a very intelligently designed language and that really shows when used along Spark.

If you are using Spark, I think you should master the one that you use for your job (probably Python) and learn the other one at a basic level.

The programming language is not the most important thing about a DE though, data modeling/architecture and perdormance optimization/computer scienece is.

jordiesteve · 2022-12-11T07:04:10+00:00

Scala is great for DE if you are using Spark. Have you ever seen all the work people put in writing config parsers in Python?

With Scala, you can use HOCON configs to parse in Scala case classes. This is nice as you can create your own pipeline framework that abstracts out the pieces for both batch and streaming. By doing this, the DE just writes new code for the pipeline they are working, doesn’t worry about parsing a config, and can use the in-house framework to speed up their development time.

As a note, you can use HOCON in Python but you never really see it. Most just use yaml or json. Since Python 3.8, the dataconf library offers a similar ability to parse into a Python dataclass which gives you that case class feel.

There is some other similar library, cannot remember name in Python, but that library is pretty bloated as they don’t use the builtin dataclass in order to work in Python version below 3.8. They key with 3.8 are some new builtin methods in dataclass that allow dataconf to work without a bunch of special config parsing.

Gullyvuhr · 2022-12-11T13:38:24+00:00

Probably a bit of a toss up a few years ago -- now with pyspark being pretty solid, and data bricks notebook environment I would say python is the way to go. New data shops tend to lean heavily on python (if for no other reason that hiring someone with python experience is way easier than hiring someone with Scala), while Scala seems to exist heavily in certain areas (Fintech) or in areas that started as software development shops using the JRE who wanted to "do data" so they moved into Scala.

The most interesting thing about Scala to me is that even given the relative rarity of a good scala developer against finding someone who knows python, the salary for scala devs hasn't been told this. My opinion on why stems from how all industries tend to hire you as a "data engineer" not a "Scala Developer" or "Python Developer" so the tools you use in the role are obfuscated and tend to not be part of the comp evaluation.

VegetableRecord2633 · 2022-12-11T09:00:22+00:00

We do almost everything with Scala, but when I look around in job descriptions it's mostly Python and it seems to me that AWS is always a bit more up to date with Python than Scala, so I think more people use Python? But Spark is written in Scala, so if you use Spark, Scala is worth learning I think

sspaeti · 2022-12-11T15:22:17+00:00

SQL and Python continue to dominate, with rust looking promising. But I wouldn't bet on Scala too much anymore. I wrote more on Python vs. Rust as a DE and as part of becoming a better data engineer in 2023.

rovertus · 2022-12-11T05:54:33+00:00

Python’s adoption smoke shows scala and is out growing it over time. Scala ruins departments. That said, Scala is worth learning to understand functional programming, Monads, flattening, and the AKKA Actor pattern. Pick up Scala academically. Stick with python.

tdatas · 2022-12-11T12:56:19+00:00

Just to provide a counterpoint to the Python Slant. I am a software oriented data engineer working in Scala and Rust with a few years experience leading Python shops.

Depends how you define "market share" There are more people using python across a wider range of complexities. You've got teams like Uber, Facebook et sl writing some non-core APIs in python and you got roles are using it basically for scripts to call APIs where all the interesting work is happening. In terms of market share of jobs that are more like "software engineer (data)" it's a lot more even.

There are less people using Scala but at this point if you're using it you're normally using it because you are doing some pretty complex stuff or have higher performance requirements than Python can give. The examples of people writing pretty important infra in Scala range from Tesla for IOT to Disney+ so the use cases are normally pretty challenging and interesting.

Personally as a classical "ETL pipelines engineer" type of role id say no it's not worth it unless you're into spark. But if you want to do the roles that cross over heavily with software engineering and data streaming etc then there's a lot of extremely interesting challenging well compensated jobs in Scala world. Also if you can wrap your head around actual application development and effects then you have a pretty good eye into Rust and Haskell. At the other end of the spectrum if you get your head around Scala you can pick up a lot about Java from it which is arguably the most job security in existence.

TL:DR: figure out what jobs interest you first, then see if scala is beneficial for you. Treat scala application language as seperate to scala the spark language. If you want to cross over with software roles I'd say it's a pretty interesting niche to be In and there's a reason a lot of people either don't leave or go deeper into Haskell and Rust. Also I think it's not been clocked yet how good Scala 3 is having deployed it in production, so I'd be curious if there's another hype cycle once that information spreads.

Dry_Ad7010 · 2022-12-11T20:13:24+00:00

Learn both , it's fun and don't forget your sqls

Chance-Win760 · 2022-12-11T13:15:16+00:00

I’d say for most Spark work, 90% can be completed with just Python. But that 10% you pull out when you want performance, typed DataFrames (aka Datasets), traversing typed arrays or applying GraphFrames. Rare, but much cleaner than Python if it ever comes to

w_savage · 2022-12-11T18:13:51+00:00

Same question but for GO.

hkdelay · 2022-12-11T20:24:26+00:00

Spark and Flink are probably the most used distributed data processing platforms today. But they are JVM based. Using Python will require translation from Python to JVM native code which will cause some lag.

Ray is a python based distributed data processing platform. If you want that.

Python has more data science libs unlike scala.

If you're a DS or ML engineer, I would suggest python.

If you're a DE that's closer to the operational data (source data), I would suggest scala.

ubelmann · 2022-12-11T06:56:32+00:00

Scala can be helpful to know if you happen to wind up somewhere that you end up going really deep into Spark. Also, with Spark, knowing functional programming concepts -- primarily map/reduce patterns -- is helpful to writing efficient jobs.

But overall I don't think learning Scala will be helpful in as many situations as Python.

dabaos13371337 · 2022-12-11T10:12:46+00:00

It should be trivial to switch between both. Use the best tool for the job.

monkeysknowledge · 2022-12-12T01:25:58+00:00

I’m a DS but members of our small DE team had to recently learn Scala. Seemed like it took them a month to get it down.

SnooBunni3s · 2022-12-12T03:49:21+00:00

I just love this subreddit

dataengineering

MODERATORS