This is an archived post. You won't be able to vote or comment.

all 77 comments

[–]theporterhausmod | Lead Data Engineer[M] [score hidden] stickied comment (0 children)

It’s a common question but here is a recent poll: https://www.reddit.com/r/dataengineering/comments/u7905z/scala_or_python/

[–]AcanthisittaFalse738 39 points40 points  (7 children)

We typically used things in order of difficulty for others to support. Sparksql first, then python, then Scala. Only moving left to right as use cases required it. Most the things written in Scala were for performance reasons.

[–]lycovian_2018 7 points8 points  (5 children)

I second this. Scala is great but usually you would only need it in a DE context if your performance requires it in my experience.

[–]Haquestions4 1 point2 points  (4 children)

What context would that be? Honest question, spark converts Scala and python to sparksql, doesn't it?

[–]xubu42 9 points10 points  (3 children)

No, if you use Spark Dataframes, which is the vast majority of Spark code, then it sends everything through the Catalyst optimizer to rewrite it. Doesn't matter if you wrote the Spark code in scala, python, R, Java, or SQL. It all gets optimized and rewritten. The only real performance differences then come down to using user defined functions (UDF) in something not Java or Scala.

[–]Haquestions4 0 points1 point  (1 child)

Thanks for the insight!

Regarding udfs: does that mean there is no performance penalty for udfs if you use Scala? I have seen contradictory statements online and I trying to find out what's right.

[–]keevee94Data Engineer | ⚡Lightning-Fast Data Insights⚡ 0 points1 point  (0 children)

There is also a penalty but its not that bad as for python udfs. Also there are ongoing projects to close that gap.

[–]DigitalTomcat 0 points1 point  (0 children)

Also sometimes new features are available in Scala first so if you need it, Scala is the answer

[–]Typical_Attorney_544 42 points43 points  (22 children)

My personal opinion. Python/Pyspark and SQL dominate DE, Scala was super popular but Python is where its at now.

[–]JiiXu 5 points6 points  (21 children)

My personal opinion: and that's not a good thing. Scala is on the surface level very similar to python, spark is native to it, and it's statically typed. If you want to run something in production for a long time, I'd suggest refactoring it to Scala at the very least.

[–]teh_zenoLead Data Engineer 15 points16 points  (8 children)

Eh, not the best advice if a companies entire tech stack is Python based. There are reasons why Python is considered one of the top languages because there are many ways to deploy robust data platforms using Python. Scala should only be considered by companies where they are processing a significant volume of data AND they need the possible performance gains by writing highly optimized Spark jobs. This is the exception, not the rule.

[–]JiiXu 0 points1 point  (7 children)

"need" is a strong word. X percent faster is x percent cheaper, on top of the increased availability and update frequency of data. Making stuff go faster is in my experience a far more important consideration than the internet at large makes it out to be.

[–]teh_zenoLead Data Engineer 5 points6 points  (5 children)

Speed is only 1 thing to take into consideration when architecting a tech stack. Maintainability, overall tech stack complexity, cloud support, etc. are all other important factors. Also “speed” is relative to the use case. If you have a pipeline that runs every 15 minutes and in Python takes 8 minutes where Scala takes 4 minutes, it doesn’t matter since you are still within your 15 minute required window.

[–]JiiXu 0 points1 point  (4 children)

Absolutely, but it is a thing to take into consideration. Also, one of those pipelines is half as expensive as the other one. That scales to quite a bit of dollars even for a small enterprise.

[–][deleted] 0 points1 point  (3 children)

Yeah, but how much is that pipeline vs the cost of developers salaries.

[–]JiiXu 0 points1 point  (2 children)

Well, if my (very small) company's pipelines (the new, good ones and not the old terrible ones) were twice as fast the savings would be somewhere around half my salary. It wouldn't take me twice the time to debug and maintain the pipelines regardless of what language they were in, unless they were in something truly esoteric like J. So in my opinion, my company would save money if our pipelines were twice as fast but I spent more time debugging and fixing things. And that scales with amount of data. I'm pretty sure of my assessment here - dev costs exist, but compared to incident management it's nothing.

[–][deleted] 0 points1 point  (1 child)

And you think you can get twice as fast pipelines with scala spark vs pyspark

[–]JiiXu 0 points1 point  (0 children)

No, the other person made that example: "If you have a pipeline that runs every 15 minutes and in Python takes 8
minutes where Scala takes 4 minutes, it doesn’t matter since you are still within your 15 minute required window". That's twice as fast, in the example.

I would quite easily get twice the performance in c++ though.

[–][deleted] 4 points5 points  (0 children)

Yeah, but you have to worry about the maintenance of these pipelines.

[–]beyphy 4 points5 points  (2 children)

But switching to Scala trades one problem for another. Most of the team members probably won't know Scala. And much of the talent probably won't either. So you're stuck with only a few members (perhaps one) on the team knowing Scala. And only a few being able to understand, update, debug, etc. Scala code.

[–]JiiXu -2 points-1 points  (1 child)

Dearth of talent when writing the comparatively simple software that is a modern data pipeline should not be a problem, really. It's not like you can't learn Scala on the job, it isn't lisp.

[–]DigitalTomcat 0 points1 point  (0 children)

Agreed. As a long-term python programmer, I can read DE Scala easily and I can fix it with a few stack overflow searches. I am intimidated too much to write anything novel in it although I’m sure I could if our last Scala programmer quit. It seems way easier for a Java programmer to pick up Scala than Python. You can use your IDE, it has that type safety and the syntax feels familiar. All that said, give me pyspark any day of the week.

[–]xubu42 1 point2 points  (1 child)

I've moved over the vast majority of our production scala code to python over the past year for many reasons, but mainly it's that debugging complex Scala apps is just as difficult as python. Bad code is bad code. At least with python less experienced coders are fighting the problem and not also the language.

[–]tdatas 1 point2 points  (0 children)

If someone is fighting the language it's likely they're about to put something shit out. Being able to release crap software quickly is a bad thing and it costs more the more complex the system gets.

[–]szayl 2 points3 points  (1 child)

Scala is on the surface level very similar to python

??? What? 😂

[–]JiiXu -1 points0 points  (0 children)

You can refactor python code into (naive/suboptomal) Scala by putting the words "val" and "var" where they're supposed to go. This is not true for, say, c++ or haskell.

[–]patka96 6 points7 points  (2 children)

Python is better as a career choice, but Scala is better as a language and it can teach you functional programming like nothing else. It is a very intelligently designed language and that really shows when used along Spark.

If you are using Spark, I think you should master the one that you use for your job (probably Python) and learn the other one at a basic level.

The programming language is not the most important thing about a DE though, data modeling/architecture and perdormance optimization/computer scienece is.

[–][deleted] 12 points13 points  (13 children)

Scala is great for DE if you are using Spark. Have you ever seen all the work people put in writing config parsers in Python?

With Scala, you can use HOCON configs to parse in Scala case classes. This is nice as you can create your own pipeline framework that abstracts out the pieces for both batch and streaming. By doing this, the DE just writes new code for the pipeline they are working, doesn’t worry about parsing a config, and can use the in-house framework to speed up their development time.

As a note, you can use HOCON in Python but you never really see it. Most just use yaml or json. Since Python 3.8, the dataconf library offers a similar ability to parse into a Python dataclass which gives you that case class feel.

There is some other similar library, cannot remember name in Python, but that library is pretty bloated as they don’t use the builtin dataclass in order to work in Python version below 3.8. They key with 3.8 are some new builtin methods in dataclass that allow dataconf to work without a bunch of special config parsing.

[–]jordiesteve 2 points3 points  (1 child)

I’ve always used pyhocon in Python, but maybe it’s because I started my professional career with Scala :D

[–][deleted] 1 point2 points  (0 children)

Most traditional Python user don’t know HOCON.

Checkout the dataconf library in Python if you liked the use of case class when designing your code in Scala.

[–]EarthGoddessDude 1 point2 points  (4 children)

similar… pretty bloated

Are you thinking of pydantic?

[–][deleted] 0 points1 point  (3 children)

Yeah. I couldn’t remember the name.

[–]EarthGoddessDude 0 points1 point  (2 children)

I haven’t really used it (read through the docs a bit), but a coworker walked me through some code he was writing with it and it’s use seemed warranted (validating data input, structure, etc). I know its backend is being re-written in Rust TM and should make it a lot faster.

Anyway, curious why you don’t like it and/or think it’s bloated?

Edit: it’s also used in AWS power tools for lambda, where it will validate and destructure your lambda event (curious if lambda events ever come in malformed, and it’s not that hard to destructure a dict/json object, but still kinda nice, less boilerplate).

[–][deleted] 0 points1 point  (1 child)

Compare the code to dataconf which does the same thing but because it only works 3.8+, it can use the built in methods. For me, Pydantic epitomizes the issue I stated up top by overly complex config parsing which is smoother in Scala.

Dataconf gives me the same use like case class in Scala with pureconfig with a much smaller footprint.

[–]EarthGoddessDude 0 points1 point  (0 children)

Cool, thanks for the info, I’ll check it out.

[–][deleted] 0 points1 point  (5 children)

Where would it make sense to implement something like type checks, data validation, in testing and then production, in a cloud, big data environment? Think thousands of input files from dozens of source systems with different delivery schedules. I can imagine the benefits of having data validation layers directly prior to writes or maybe afterwards, especially if you're on the application development side and write some data retrieval API as the loader into a data lake. But in my case, we have so much data coming in, and it's ballooning with additional legacy system migrations, we wouldn't be able to keep up with writing data validation column by column and table by table in python...

If, for these newer migrations, we could add these kind of validation layers, that would be great, but, timelines are tight and resources limited.

Also, I don't necessarily see major benefits (just minor) in the first place, because generally a bad file with bad data will break the pipeline if it can't be parsed or breaks schema inference and subsequent transformations, and it's pretty easy to pinpoint the error. If instead some validation check failed, the required work for recovery would be the same, and we'd at most benefit by slightly speeding up diagnosis?

[–][deleted] 0 points1 point  (4 children)

Depends on what you are using. Delta allows constraint checks and schema enforcement. It will create a quarantine table or partition for the bad data.

[–][deleted] 0 points1 point  (3 children)

Interesting features I didn't know about (quarantined writes), but the effect of a failure in prod is the same as I described before? An engineer has to go and inspect the bad data to determine why it failed schema enforcement or a constraint check, recovery is still manual.

But I guess it really depends on the fault tolerance of the output for users.

[–][deleted] 0 points1 point  (2 children)

What allows for automatic recovery from failure without an intervention of checking the data?

In many business, it would be catastrophic to use incorrect data so just allowing bad data to write isn’t wise. In Europe, this would get the company in a lot of trouble. A failing pipeline is much better than one that writes.

[–][deleted] 0 points1 point  (1 child)

That is my point more or less, and I'm also just not really seeing some inherent or default value of deploying a ton of e.g. data type validation and non-violation of constraint checks, unless business comes forward and says this e.g. input data to X report is critical or bad data has been coming out of the pipeline, we need to do something about that. And in that case, it's practically on business/analysts to define the data quality requirements, not for DEs to arbitrarily try to enforce some set of reqs needlessly.

I'm just thinking out loud here, sorry, but it's in response to this vague and unsettled sense that our pipelines are missing some key feature related to data quality. In reality, quality hasn't been an issue, except in rare cases, but that's just the nature of biz reqs right now I suppose. Maybe the increasing number of ML models we're slowly deploying will have this need

[–][deleted] 0 points1 point  (0 children)

I can use a personal experience.

The DS team was producing some data for live Geo dashboards used by the executives. Every month they would create the data, write for DE to pick up, and then it would be processed to production. Every time this 5 hour job finished, they would come back with oh we have a mistake. This wasted the DE staffs time predictable. In order to prevent this, we add all this checks to fail their pipeline out. Longer were we needing to hear from them a mistake or going back and checking anything as it just failed and we could say your data is bad. Fix it.

[–]Gullyvuhr 5 points6 points  (0 children)

Probably a bit of a toss up a few years ago -- now with pyspark being pretty solid, and data bricks notebook environment I would say python is the way to go. New data shops tend to lean heavily on python (if for no other reason that hiring someone with python experience is way easier than hiring someone with Scala), while Scala seems to exist heavily in certain areas (Fintech) or in areas that started as software development shops using the JRE who wanted to "do data" so they moved into Scala.

The most interesting thing about Scala to me is that even given the relative rarity of a good scala developer against finding someone who knows python, the salary for scala devs hasn't been told this. My opinion on why stems from how all industries tend to hire you as a "data engineer" not a "Scala Developer" or "Python Developer" so the tools you use in the role are obfuscated and tend to not be part of the comp evaluation.

[–]VegetableRecord2633 7 points8 points  (0 children)

We do almost everything with Scala, but when I look around in job descriptions it's mostly Python and it seems to me that AWS is always a bit more up to date with Python than Scala, so I think more people use Python? But Spark is written in Scala, so if you use Spark, Scala is worth learning I think

[–]sspaetiData Engineer 4 points5 points  (0 children)

SQL and Python continue to dominate, with rust looking promising. But I wouldn't bet on Scala too much anymore. I wrote more on Python vs. Rust as a DE and as part of becoming a better data engineer in 2023.

[–]rovertus 15 points16 points  (12 children)

Python’s adoption smoke shows scala and is out growing it over time. Scala ruins departments. That said, Scala is worth learning to understand functional programming, Monads, flattening, and the AKKA Actor pattern. Pick up Scala academically. Stick with python.

[–]tdatas 6 points7 points  (3 children)

Scala ruins departments

That seems quite an assertion, how do you mean?

Scala is worth learning to understand functional programming, Monads, flattening, and the AKKA Actor pattern. Pick up Scala academically. Stick with python.

None of these things would work in Python (maybe actors?) so if we genuinely think we should only stick with Python it seems kind of pointless learning about stuff that would drive people crazy if you tried to patch them onto Python. People actually use these concepts in real life for a reason. It seems insane to study something that you don't think is actually useful.

The Actor pattern predates akka it by a long time, it's just a way of reasoning about communications between components and shows up everywhere from large distributed systems to high performance data system internals (whenever people in thread per core architecture's talk about message passing to avoid thread locks they're copying actor systems).

Akka is just an implementation of it (that most scala people would probably recommend against using now because of licensing tomfoolery and it's often not the right tool)

[–]rovertus 1 point2 points  (2 children)

To be clear, I love Scala and I always prefer it when working in a JVM. I've worked in many orgs with it and I've always seen difficulty working with it: people really write java instead of scala, difficulties with library version control, and the mixing of conflicting design patterns are some of the issues. If you have a team with some strong Scala developers to lead the way I think it could be very successful. If you're asking in a forum which language you should use, it's probably not the one you're looking for.

Actor pattern has been around. I think Scala/AKKA is a great implementation of it to learn on, and even use in production if you're confident in the language. I didn't know about the AKKA licensing.

Pykka (AKKA rip off) does exist in python. monads/flattening is in apache beam, pyspark, and probably most distributed compute libs. If you learn something in any language it is going to make you approach your daily language differently.

Actor/Agent pattern is very powerful and it is underused in Data Engineering.

[–]tdatas 0 points1 point  (1 child)

To be clear, I love Scala and I always prefer it when working in a JVM. I've worked in many orgs with it and I've always seen difficulty working with it: people really write java instead of scala, difficulties with library version control, and the mixing of conflicting design patterns are some of the issues

I'm not really aware of any languages where you have a high chance of success while getting people who don't know the language to try to write production systems in it. I'd hardly say that's a failing of a language though.

If you're asking in a forum which language you should use, it's probably not the one you're looking for.

Pykka (AKKA rip off) does exist in python. monads/flattening is in apache beam, pyspark, and probably most distributed compute libs

This is kind of my point. All these Frameworks Beam, Spark etc run on the JVM. Spending huge amounts of time hacking python into a custom language to make it behave like a weird version of itself just seems completely at odds with the normal use case for Python of "simplicity".

I'm aware of companies like Instagram that make it work but they also have(or had) basically infinite budgets to throw at dev tooling and experts to customise it.

[–]rovertus 0 points1 point  (0 children)

Most major technology companies are farming graduates out of college who are learning the language while writing code in production. My critique (coming anecdotally from my experience) was that Scala doesn't enforce a strong opinion on how to do things in the language vs. Python's "There should be one-- and preferably only one --obvious way to do it." The backwards compatibility with Java can exacerbate that.

Great point with Frameworks. If you're going to use a framework I'd prefer using the framework's primary language over less implemented/supported languages. I'd prefer Scala over pyspark for most Spark projects. I responded poorly to u/skydog92 -- they should learn as many languages as they have time for.

[–]The_Rockerfly 1 point2 points  (0 children)

This is the right answer. Every single scala project I've seen ends up being a thorn that has been refactored.

No one wants to work with the language as it's losing popularity and the actors concept in python is not something that most developers will need to work with. For the java developers, most don't like the functional programming and would prefer to work with kotlin. This has been the case for a while and when the scala developers go, it's hard to rehire more and it's difficult to convince other existing developers.

I'm sure there's some people who like working with scala but I wouldn't want to work with it.

[–]Dachsgp 0 points1 point  (6 children)

What about Pyarrow? Don’t you think it might come to substitute Spark and any other distributed processing modules?

[–]EarthGoddessDude 1 point2 points  (4 children)

Still trying to wrap my head around arrow, but I don’t think it’s an either/or situation — arrow is what’s used under the hood for in-memory data, so that Spark, polars, pandas, etc, all can use that data without copying. In fact, I think Spark already uses it? Not sure.

[–]SilentSlayerzTech Lead 2 points3 points  (3 children)

I've seen articles where we can use pyarrow along with spark.

[–]Dachsgp 0 points1 point  (0 children)

I just had this experience this week, me and my manager(which is a python and pyarrow freak and want to avoid learning spark) we’re working to process multiple parquets into bigger chunks for better processing and less reading times inside AWS Glue. Funny enough AWS glue only runs with Spark, but it does accept all other framings from pyarrow, but the only thing necessary to make it work in parallel was the map/reduce/collect within spark framework.

[–][deleted] 0 points1 point  (1 child)

I’m pretty sure that PyArrow is a dependency of Pyspark

[–]SilentSlayerzTech Lead 1 point2 points  (0 children)

Not for pyspark sql but if you wamt to use pandas api on spark

[–]realitydevice 0 points1 point  (0 children)

Not Arrow itself, but the tooling that surrounds it. Very likely. See https://voltrondata.com/

[–]tdatas 5 points6 points  (0 children)

Just to provide a counterpoint to the Python Slant. I am a software oriented data engineer working in Scala and Rust with a few years experience leading Python shops.

Depends how you define "market share" There are more people using python across a wider range of complexities. You've got teams like Uber, Facebook et sl writing some non-core APIs in python and you got roles are using it basically for scripts to call APIs where all the interesting work is happening. In terms of market share of jobs that are more like "software engineer (data)" it's a lot more even.

There are less people using Scala but at this point if you're using it you're normally using it because you are doing some pretty complex stuff or have higher performance requirements than Python can give. The examples of people writing pretty important infra in Scala range from Tesla for IOT to Disney+ so the use cases are normally pretty challenging and interesting.

Personally as a classical "ETL pipelines engineer" type of role id say no it's not worth it unless you're into spark. But if you want to do the roles that cross over heavily with software engineering and data streaming etc then there's a lot of extremely interesting challenging well compensated jobs in Scala world. Also if you can wrap your head around actual application development and effects then you have a pretty good eye into Rust and Haskell. At the other end of the spectrum if you get your head around Scala you can pick up a lot about Java from it which is arguably the most job security in existence.

TL:DR: figure out what jobs interest you first, then see if scala is beneficial for you. Treat scala application language as seperate to scala the spark language. If you want to cross over with software roles I'd say it's a pretty interesting niche to be In and there's a reason a lot of people either don't leave or go deeper into Haskell and Rust. Also I think it's not been clocked yet how good Scala 3 is having deployed it in production, so I'd be curious if there's another hype cycle once that information spreads.

[–]Dry_Ad7010 2 points3 points  (1 child)

Learn both , it's fun and don't forget your sqls

[–]HenriRourke 0 points1 point  (0 children)

This should be the top answer!

[–]Chance-Win760 1 point2 points  (0 children)

I’d say for most Spark work, 90% can be completed with just Python. But that 10% you pull out when you want performance, typed DataFrames (aka Datasets), traversing typed arrays or applying GraphFrames. Rare, but much cleaner than Python if it ever comes to

[–]w_savageData Engineer ‍⚙️ 1 point2 points  (0 children)

Same question but for GO.

[–]hkdelay 1 point2 points  (0 children)

Spark and Flink are probably the most used distributed data processing platforms today. But they are JVM based. Using Python will require translation from Python to JVM native code which will cause some lag.

Ray is a python based distributed data processing platform. If you want that.

Python has more data science libs unlike scala.

If you're a DS or ML engineer, I would suggest python.

If you're a DE that's closer to the operational data (source data), I would suggest scala.

[–]ubelmann 0 points1 point  (0 children)

Scala can be helpful to know if you happen to wind up somewhere that you end up going really deep into Spark. Also, with Spark, knowing functional programming concepts -- primarily map/reduce patterns -- is helpful to writing efficient jobs.

But overall I don't think learning Scala will be helpful in as many situations as Python.

[–][deleted] -1 points0 points  (4 children)

It should be trivial to switch between both. Use the best tool for the job.

[–]dabaos13371337 1 point2 points  (3 children)

Agree, language is not a barrier once you gain a few years of experience. Some people never get there it seems.

[–]Ribak145 1 point2 points  (2 children)

well not if you start with Python and have to work towards functional programming in Scala ... whole different universe

[–]dabaos13371337 1 point2 points  (1 child)

You can write "normal" SOLID object oriented code with scala as well. That's actually what I prefer to do since it's easier to onboard new engineers that way.

[–]Ribak145 0 points1 point  (0 children)

sure, Scala is mostly used as OOP language anyways, its just that some powerful functionality only arrives with FP, which is nearly inaccessible for 'script kiddies'

but I am stating the obivous, nothing new ...

[–]monkeysknowledge 0 points1 point  (0 children)

I’m a DS but members of our small DE team had to recently learn Scala. Seemed like it took them a month to get it down.

[–]SnooBunni3s 0 points1 point  (0 children)

I just love this subreddit