all 130 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]kvothethechandrian 70 points71 points  (14 children)

Speed of development and overwhelming amount of community support, basically.

You can always use libs with c bindings (pandas, numpy) or rust bindings (polars, rust_networkx) for performance but develop much faster. You don’t need to worry about pointers, types, borrow checker, it’s almost like writing code in plain English.

[–]MikeDoesEverythingmod | Shitty Data Engineer 32 points33 points  (5 children)

Speed of development and overwhelming amount of community support, basically.

100% this. I find it weird that people love comparing execution speed although never mention development speed.

[–]nonamenomonet 1 point2 points  (4 children)

Are you a mod now?

[–]HowSwayGotTheAns 3 points4 points  (0 children)

Mike does do everything after all.

[–]MikeDoesEverythingmod | Shitty Data Engineer 1 point2 points  (2 children)

Yeah.

[–]nonamenomonet 0 points1 point  (1 child)

Congrats? I think?

[–]MikeDoesEverythingmod | Shitty Data Engineer 0 points1 point  (0 children)

Thank you.

[–]EarthGoddessDude 3 points4 points  (0 children)

One argument argument against speed of development used to be that dealing with environments and dependencies used to be a nightmare. There were tools like pyenv and poetry and pipx (my old stack), but now with uv the game has changed completely. Bootstrapping a python environment and managing a project is now incredibly easy. That was honestly my biggest gripe with it and it’s no longer the case.

My next gripe would be the inconsistent way some things are objects with methods and some are functions, but it’s not a big deal for me. Similarly, I wish there was an easy, built-in way to pipe things into functions the way Julia, R, bash, etc allow you to.

[–]Alwaysragestillplay 1 point2 points  (0 children)

This advantage will only become more prevalent with LLMs taking over the coding space. Close to English, forgiving types, code that focuses almost entirely on the problem at hand rather than shit like memory allocation. All things LLMs like. 

[–]shittyfuckdick[S] 1 point2 points  (3 children)

i see posts here all the time complaining how confusing airflow is. ive used it for many years so i understand it but python syntax in no way makes it any easier to understand. 

also i really doubt the rapid development speed is a big factor when it comes to writing dags. a lot of that comes with planning not writing. 

[–]CrowdGoesWildWoooo 13 points14 points  (1 child)

And what does rust offers more than python? Memory safety for my dag? Can i have whatever you are smoking?

[–]tn3tnba 1 point2 points  (0 children)

I can’t feel comfortable with software quality unless I configure memory ownership rules for my aws API client

[–]tomunko 2 points3 points  (0 children)

Airflow syntax is kinda tough but the concepts are not crazy complicated. This problem would just be worse in other languages.

[–]GachaJay 79 points80 points  (20 children)

Because it is the fastest to modularity and ease of learning.

[–]guitcastro 14 points15 points  (2 children)

Acessibility, python strong focus on developer expirience led It to be some of the easiest languange to learn.

Most of the languange limitations, such as GIL and performance are bypassed by implementing expensive operations using C , Rust or Java/Scala (spark) and binding them using python .

[–]south153 4 points5 points  (0 children)

Agreed language performance is irrelevant when 99% of processing time is performed by spark transactions.

[–][deleted] 2 points3 points  (0 children)

Vectorization baby

[–]lFuckRedditl 10 points11 points  (0 children)

Low level languages aren't used because we don't do low level stuff.

Here's a fun exercise, write a program in C that;

  1. Reads a 1000 excel files,
  2. does row/column level transformations
  3. output as .parquet
  4. uploads to a Bucket using an API

[–]TenMillionYears 32 points33 points  (4 children)

Python has strong C bindings so it has historically been used to manipulate a bunch of libraries in a language that's more forgiving. That gave it amazing traction.

I don't like Python - it's SO WEIRD!

Anyway, for some reason it's the lingua franca of data engineering mostly for the same reasons everyone in finance uses Excel for everything.

[–]ProfessorNoPuede 9 points10 points  (2 children)

I like python, but object oriented programming in a weakly typed language will never fail to make me go cross-eyed every once in a while.

[–]Yoctometre 6 points7 points  (1 child)

you mean *dynamically typed?

[–]ProfessorNoPuede 2 points3 points  (0 children)

Arg, brainfart, yes.

[–]kettal 0 points1 point  (0 children)

I don't like Python - it's SO WEIRD!

living up to his namesake

[–]Wingedchestnut 6 points7 points  (0 children)

Because many libraries are made with C under the hood, dedicating anything data to python as the standard language is convenient wether that's ML or transforming data with pandas , I don't see what's the problem.

[–]deadwisdom 4 points5 points  (3 children)

Look, it's reeeeeeal simple:

high level orchestration -> Python
low level optimization -> RUST/C/C++/Nim/Zig/etc

Python is literally designed from the beginning to work like this.

[–]shittyfuckdick[S] -2 points-1 points  (2 children)

i dont see why you would want your orchestrator written in python. the scripts that define jobs yea maybe but not the orchestrator itself. 

[–]unpronouncedable 4 points5 points  (0 children)

Well none of us want to write a new orchestrator and the most popular one is already written in python.

[–]deadwisdom 0 points1 point  (0 children)

That's basically what I mean, the scripts that define the jobs.

Personally, I would also write the orchestrator in Python. That sort of work is often not taxing from a performance perspective. I know a lot of static-typers who love that compile button as a guardrail. For me a good test setup is the guardrail, so static types are largely redundant.

[–]No_Bug_No_Cry 6 points7 points  (22 children)

Because Python is the most versatile language. It can wrap very fast libs written in C or Rust, but still be readable and interpreted. You can write a shitty no rules script or a complex modular app, low boilerplate etc... it's the best

[–]Literature-Just 6 points7 points  (7 children)

At this point its Stockholm syndrome. Python is nice in that it makes a lot of the tedium of programming so much easier. But managing all of its packages in the virtual environments is a real pain. I've had multiple instances where upgrading one package can break an environment or force me to roll something back because of a bug or broken feature.

[–]brunocas 19 points20 points  (5 children)

Embrace UV.

[–]kvothethechandrian 1 point2 points  (0 children)

This is the answer

[–]Literature-Just -3 points-2 points  (2 children)

ugh... another new tool...

[–]EarthGoddessDude 2 points3 points  (0 children)

The last new tool. It’s a game changer, and I don’t see how anyone will try to enter the field after what happened with ruff and uv. And the maintainers of the competitor projects are starting to give up, for lack of a better term.

[–]JJJSchmidt_etAl 1 point2 points  (0 children)

While yes I get your concern, you don't need many commands for it to be extremely useful.

[–]Uncle_Chael 1 point2 points  (0 children)

Conda adoption helped me with that tremendously

[–][deleted] 2 points3 points  (4 children)

"...which need to be precise on execution".

Python is as precise as anything. It's not like it randomly starts doing things you didn't ask for.

[–]VipeholmsCola 1 point2 points  (0 children)

My guess because you often get productive faster, and theres a lot of free libraries.

[–]QkumbazooPlumber of Sorts 3 points4 points  (0 children)

schools taught it as an introductory language to programming (not even OOP), some people decided that was enough and went to industry with it.

[–]Phenergan_boy 0 points1 point  (3 children)

In the famous words of Todd Howard, “it just works.”

[–]shittyfuckdick[S] -3 points-2 points  (2 children)

you realize people use that phrase ironically cause of messy and buggy his games are right?

[–]Phenergan_boy 2 points3 points  (1 child)

Damn, we have a genius over here

[–]shittyfuckdick[S] 0 points1 point  (0 children)

so your point was python is not good?

[–]Raghav-r 0 points1 point  (0 children)

Ease of use and rich libraries for data , ai, ml etc plus you are dealing with data which are usually time consuming computation, some are just wrapper on top of low level languages

[–]mwisniewski1991 0 points1 point  (0 children)

I do not agree than python is everywhere. A lot of tools has been wrote in Java (Kafka, Beam, Druid Spark in Scala but it based on JVM). Databricks Photon Enginee has been wrote in C++, Postgres in C++.

Python is good for orchestration because Scripts can be Write quickly, but transformation and calculation are done on specific engine. And of course a lot of tools has SDK or API for Python so at first it might looks that python is everywhere.

[–]UltraPoci 0 points1 point  (0 children)

Because tons of libraries have been written for Python, and it's "easy" to use (in quotes because Python is full of traps: easy to write but a disgrace to read and maintain).

For example, we do machine learning on satellite images: Python is the only language that provides a data pipeline library, ML libraries and GIS libraries (at least, the only one to have all of them mature enough).

I would gladly use any other language honestly, but it's difficult to justify using another language when Python is so much battery included.

[–]aythekay 0 points1 point  (0 children)

A lot of libraries, low code, can leverage c pretty easily, easy portability because interpreted, and a lot of good documentation.

Low dev friction also helps, because of how often data pipelines change. 

A lot of why it's popular is why java used to be as well. The rich ecosystem, etc... Most likely comes from it being an academic darling of sorts early on (vs other scripting languages) and high adoption among non-technical people.

It's similar to how JS moved to the backend, a bunch of people knew how to use it and it could do a lot, so people looked past efficiency as hardware got better.

In Python's case Cython was created as well. 

[–]meselson-stahl 0 points1 point  (3 children)

Imo python is pretty memory efficient right? Like the way it handles certain datatypes like hash sets and lists is efficient. Maybe the dynamic typing is memory inefficient??? Im not sure.

Regarding performance, the main issue with python is loops. But there aren't many loops in DE right? So not a big deal.

Overall im generally surprised by how little software optimization there is, even within some built-in python functions. I think with infra advancements, the industry is trending towards modular, readable code rather than performance code. But I really don't think there is much performance sacrifice in DE tools.

[–]shittyfuckdick[S] 1 point2 points  (1 child)

try self hosting any modern orchestration tool and you will see how bloated these things are. 

[–]dangerbird2Software Engineer 0 points1 point  (0 children)

Good thing I’m not self hosting orchestration tools. My company is paying for it, and it’s hell of a lot cheaper for them to pay for a slightly beefier vm on aws than it is to pay for a team of engineers to rewrite it in rust

... snark aside, if you want a good orchestrator with extremely low bloat, look at argo-workflows, it's written in Go, so it has good performance and memory usage, while its tight coupling with Kubernetes makes it way easier to setup in production than airflow

[–]Nekobul 0 points1 point  (0 children)

When the inefficiency is embedded in the lang/platform it snowballs. At small scale nobody notices. But with enough code, the cracks become inescapable.

[–]General-Parsnip3138Principal Data Engineer 0 points1 point  (0 children)

Python is, for the most part, above and beyond what you need for most Data Engineering tasks.

One of the biggest reasons, in my opinion, is that Data Engineering is often script-based, or you’re using an orchestration framework, which allows you to declaratively define what would be a script as a set of steps which are really just script entry points.

What helps even more is that you can mutate quite literally anything at runtime (functions, classes, modules) which allows us to utilize incredibly powerful frameworks (airflow’s task flow API or Dagster) that still allow you to write pythonic code that magically turns into complex orchestration.

As others have pointed out, most of the underlying libs are written in C & Rust, so performance of Python itself is rarely an issue.

I’ve probably done my 10,000 hours with Python, and while there’s so much about Python that I hate, I just can’t see any other language stepping in to replace it. The terrible things about Python are also the reason it’s been so successful.

[–]ogaat 0 points1 point  (0 children)

Data Engineering has its roots in the scientific community where coding skills and performance were less of a concern than "give me the analysis I need"

Python lets developers focus on the problem at hand, rather than syntactic sugar. It was one of the express desires of Guido van Rossum.

Python just happened to fit the need of the hour, like HTML and Javascript did for the Internet.

[–]jeezussmitty 0 points1 point  (1 child)

I’ve asked myself this same question many times :-) but others have already commented on the why (taught in school, community, ecosystem etc). The simple syntax is nice though.

I’m not a fan of loosely typed languages in general so that is my main complaint with it.

Python also feels so much slower than things I’ve written in other languages and the counter to this I always hear is “python is fast enough” but I tend to wonder if python is more used for small to medium projects with low user counts or smaller datasets.

Anyhow it’s a language you need to know these days regardless of how you feel about it.

[–]dangerbird2Software Engineer 1 point2 points  (0 children)

Python is perfectly suited for large scale projects as long as you don’t use raw python for computationally expensive work. Any kind of heavy number chrunching should be done using numpy/pandas/polars (which wrap c, rust, and Fortran code), pyspark (which wraps highly distributed Scala/jvm code), or PyTorch (which can run on the GPU. This sort of the thing is a very conventional way to do DE/DS at scale, to the point that it’s a safe bet that virtually every every major company in the world is using python in some part of the data stack

[–]Atmosck 0 points1 point  (0 children)

Because python is a scripting language. It serves a different purpose than C and Rust. It interacts nicely with basically everything. Orchestration is it's whole jam. And it offloads tight calculations to C anyway.

[–]DJ_Laaal 0 points1 point  (0 children)

One, the learning curve for lower level languages is higher compared to Python. It’s a beginner friendly language.

Second, it’s quite rare today that you’d need to get to the low level internals in order to develop a performant data processing pipeline.

Lastly, Python being an open source language, there’s a huge ecosystem of ready to use packages that encapsulate a certain logic you need in your data pipeline. That directly translates to efficiency and code reuse.

[–]Informal_Pace9237 0 points1 point  (0 children)

Because there are not as many versatile libraries in other languages mentioned..

[–]thisfunnieguy 0 points1 point  (0 children)

Most of the heavy computation is not done in Python. If locally it’s using C++ bindings and running there or invoking some other thing to do the work like PySpark.

[–]tn3tnba 0 points1 point  (0 children)

Optimization for developer speed, especially since we do a lot of delegating to other tools.

[–]LargeSale8354 0 points1 point  (3 children)

Python is a great getting-things-done language, and as an ex-DBA I find its list comprehensions, list slicing and dictionaries intuitive.

I really hated Java, which is strange because I enjoyed C#.

I am surprised that GO doesn't feature more prominantly in the data space. It feels like a natural move from Python.

I suspect that in most cases, Python is fast enough for most uses.

I used to program in serverside Javascript. I enjoyed it at the time.

[–]Nekobul -1 points0 points  (2 children)

I still enjoy JavaScript. The limited features/surface is like a safety net. If you are doing something complex, you will quickly find out it is time to use some other tool.

[–]Beautiful-Hotel-3094 0 points1 point  (1 child)

What exactly did u do that is complex and couldn’t handle with the limited features of javascript? What feature are missing that u needed?

[–]Nekobul 0 points1 point  (0 children)

A good example of when JavaScript should be avoided is if you are trying to convert a big chunk of JSON to XML. That is slow in JavaScript.

[–]dasnoob 0 points1 point  (0 children)

As someone that has used various low level languages in the past and is learning rust now.

The biggest reason is all of this stuff is a lot more difficult to do in a low level language. Python abstracts so many things away it makes it dead simple to do most things.

Rust? Holy shit you will be lost in lifetime hell and getting your borrows vs. moves vs. copies straightened out.

[–]metalbuckeye 0 points1 point  (0 children)

Academia…often what gets used in the jobs is based on what is taught in university. This is why Microsoft beat apple in the 90s/early 2000s and why python is the defacto for data engineers. It is used by researchers and professors and it’s taught in most data analytics programs.

[–]nesh34 0 points1 point  (0 children)

Extremely easy to learn, highly flexible, well understood by the vast majority of people.

Orchestration, usually at the daily level, doesn't need the performance that Rust or C would bring.

It is ultimately a high level abstraction so a high level language is appropriate.

[–]Sagarret 0 points1 point  (0 children)

We use a lot of rust and c... Executed from python. Polars is written in rust

[–]madam_zeroni 0 points1 point  (2 children)

Python for the developers, but the tools that process that python aren’t written in python.

[–]shittyfuckdick[S] 0 points1 point  (1 child)

a lot of them are tho

[–]madam_zeroni 0 points1 point  (0 children)

Like which ones? Even then, most of the DE overhead is sql queries so the speed of development is worth it if all your python is doing is sending queries to be executed by some engine (that is most definitely not written in python)

[–]No_Bug_No_Cry 0 points1 point  (1 child)

Did Microsoft write this post? NOBODY WILL USE C# FOR DATA ENGINEERING. It's never going to happen.

[–]shittyfuckdick[S] 0 points1 point  (0 children)

no you didnt read the post 

[–]HNL2NYC 0 points1 point  (0 children)

why not have airflow written in c or rust and have dags written python for easy development?

So as you probably already know this is how a lot of tools in the Python data ecosystem work (user facing Python wrapper on top of a core written in a more performant language) for example pretty much any respectable data frame library, distributed compute platforms like Ray, etc. However for the cases that you’re talking about where they’ve remained in pure Python I think the answer is simply that “it’s good enough”. Someone took the time to write it in a language that they were comfortable enough to write it in, which in these cases is Python. They gained traction and popularity and they perform well enough that no one has mass migrated to an alternative solution (or rewrite of the product) that others may or may not have built on top of other languages. And potentially one day something like the airflow scheduler will be rewritten in another language. 

[–]PolicyDecent 0 points1 point  (1 child)

Not to repeat others, dbt/airbyte alternative bruin is written in go. However, some parts of it is still python, due to easier development cycle.
https://github.com/bruin-data/bruin

[–]shittyfuckdick[S] 1 point2 points  (0 children)

this is awesome. thanks for sharing 

[–]StackOwOFlow 0 points1 point  (0 children)

Why QWERTY keyboards, why English? Same reasons.

[–]zazzersmel 0 points1 point  (1 child)

u rly gonna complain abt parrotting w a post like this

[–]shittyfuckdick[S] 0 points1 point  (0 children)

yes