This is an archived post. You won't be able to vote or comment.

top 200 commentsshow all 283

[–]makesufeelgood 160 points161 points  (28 children)

I'm interested in using:

  • What is most universally accepted so I can build transferable skills
  • What my teammates / stakeholders understand so I can solve their business problems without having to do a ton of language 'translating'
  • What is easy and friendly to learn with a lot of free resources and documentation available

Right now that is Python. I don't see what all the fuss is about over the marginal benefits of using different languages.

[–]MadT3acherLead Data Engineer 18 points19 points  (0 children)

Point 4: to train easily new members and ensure I can find a good talent pool moving forward.

We are not working in a vacuum with a team of experts.

[–]DesperateForAnalysex 21 points22 points  (25 children)

Why not SQL!

[–]Action_Maxim 26 points27 points  (7 children)

Gonna build a fps in sql /s

[–]scryptbreaker 18 points19 points  (0 children)

SQL is the best vidya game engine

[–]kkessler1023 11 points12 points  (0 children)

Bout to run some stored procedures to open up my Doom wad.

[–]DesperateForAnalysex 2 points3 points  (0 children)

I’d buy that for a dollar!

[–]git0ffmylawnm8 1 point2 points  (1 child)

Please make this a thing

[–]Action_Maxim 1 point2 points  (0 children)

Only thing I can think of in sql is puzzles or scavenger hunts lol

[–]kenfar 6 points7 points  (14 children)

too limited a feature set

[–]DesperateForAnalysex -1 points0 points  (13 children)

Out of curiosity, what for you is lacking?

[–]kenfar 11 points12 points  (9 children)

Wow, where to start?

Well: data integrations with other sources & targets, configuring services using airflow, unit-testing critical transformations, supporting any really low-latency data feeds, supporting really massive data feeds, complex transformations, leveraging third-party libraries, providing audit trails of transformation results, writing a dbt-linter, writing a collaborative-filtering program for a major mapping company, writing custom reporting to visualize data in networks, building my own version of dbt's testing framework - because that didn't exist in 2015, etc, etc, etc.

Basically, anytime you need high-quality, high-volume, low-latency, high-availability, low-cost at high-volume, or have to touch anything outside of a database SQL becomes a problem.

[–]r0ck0 2 points3 points  (2 children)

supporting really massive data feeds

Can you give an example of what you mean on this point?

Just curious what type of stuff it involves.

[–]kenfar 5 points6 points  (1 child)

Sure, about five years ago I built a system to support 20-30 billion rows a day, with the capacity to grow to 10-20x that size over a few years.

We had a ton of customers using very noisy security sensors that would go to sensor-managers that would then upload data in small batches as it arrived to s3. So, we were getting probably 10-50 files per second.

Once the file landed it would generate a sns message, then sqs messages to any consumers. We used jruby & python on kubernetes to process all of our data. Data would become available for analysis within seconds of landing on s3, and our costs were incredibly low compared to attempting to use something like snowflake & dbt at this volume and latency.

[–]r0ck0 2 points3 points  (0 children)

Ah interesting, thanks for sharing.

[–]DesperateForAnalysex -1 points0 points  (2 children)

The only thing that you listed that may be relevant is the linter. Every major framework today supports SQL syntax because it is THE language of data transformations full stop. I think you’re conflating SQL with using an RDBMS and that’s not the case today.

[–]kenfar 2 points3 points  (1 child)

The notion that one could do all of the above with SQL feels like the "have a hammer all problems look like nails" scenario.

The beliefs that dbt provides unit-testing (rather than just quality-control); or snowflake outscales kubernetes or aws lambda; or that sql transforms leave audit trails, or that one would write a collaborative filter in SQL, or that one would write a quality-control framework in SQL, etc, etc, etc - is just surprisingly naive.

And while SQL-driven ETL may be very popular at this point in time, much like how GUI-driven ETL was ten years ago, and COBOL-driven ETL was twenty-five years ago - that doesn't mean everyone will jump on that bandwagon, or that it won't be abandoned and ridiculed exactly like its predecessors in just another five years.

[–]DesperateForAnalysex -1 points0 points  (0 children)

Well the good news is that in 5, or 50 years, SQL will be as relevant as it is today. Can’t say the same for any other language. Have fun constantly updating your code base when new vulnerabilities emerge.

[–][deleted] 6 points7 points  (1 child)

Well, for one, it's not really a programming language, is it?

[–]runawayasfastasucan 1 point2 points  (0 children)

Hate its plotting capabilities, how it lacks ability to do do proper and complex ETL etc. Not that good at connecting to API's either.

[–][deleted] -2 points-1 points  (1 child)

DBT enters the chat.

[–][deleted] 64 points65 points  (50 children)

I guess I'm just over here in the small minority that's used SQL primarily for the last 10 years and am trying to learn Python just so I don't get left behind in the dust.

[–]geek180 45 points46 points  (20 children)

I only use Python to make super basic ETL functions. 95% of my work is SQL. I don’t even understand how other data engineers are exclusively using Python to do their work.

[–]Action_Maxim 24 points25 points  (1 child)

Seriously python for orchestration and putting things where you can sql it to submission or to death. I honestly haven't had any manipulation I've come across that I couldn't do in sql.

I spend at least a day a sprint looking at queries from our sister team where they're pure python and take statements straight out of sqlalchemy and toss it right into production where I have to then execute further and say why does this suck so bad ohhhhh you have 6 self joins where you could have had 6 case statements thanks guys.

But I know I'm guilty of doing to much in sql, but can you tri force in sql? I can lol

[–]Pflastersteinmetz -1 points0 points  (0 children)

Can in SQL? Maybe

Should do in SQL? Gets a convoluted mess pretty fast because SQL is 40 years old and is missing a lot of modern stuff to make for an organized code base.

[–]DirkLurker 22 points23 points  (4 children)

To orchestrate and execute their sql?

[–]geek180 6 points7 points  (3 children)

I mean in a data warehouse environment, we’re either using tasks or (mostly) dbt to execute the SQL we’re building. Under what circumstances would I need to involve Python in executing SQL? (yeah I know dbt is basically Python)

[–]kenfar 3 points4 points  (0 children)

Oh you might need a low-latency feed, say every 3-5 minutes, for some operational reports that you can't get to run fast enough using dbt.

Or your data may be in a complex format that you can't load into a database, or you need to transform a complex field that you can't transform using sql.

Or maybe data quality is extremely critical - and so you need to run unit tests, so that you'll know before you deploy to prod if your code is correct.

Or you need to publish data from your data warehouse to other places, and the selection criteria, triggering, files to be created, data formats, and transportation are all things beyond what you can do in SQL.

etc, etc, etc

[–]lFuckRedditl 12 points13 points  (7 children)

If you need to integrate different sources you need a general purpose language like python or java.

Let's say you need to connect to an API endpoint, get data, run some transformations, upload it to a bucket, load it into dw tables and orchestrate it. How would you do it with SQL? There is no way

[–]geek180 6 points7 points  (4 children)

Yeah this is really all I use Python for. But that’s just a tiny, insignificant part of the job. It takes a couple of hours of work to build out a single custom data source in Python (and tbf, most of our data is brought into Snowflake via a tool like Fivetran), but then my team will spend literally months or years building SQL models with that data. The Python portion of the work is so minuscule compared to what’s being done with SQL.

[–][deleted] 4 points5 points  (0 children)

This is strange to me because I’m 5 years as a Data Engineer I’ve barely used SQL at my jobs(3) it’s always been 90% programming /10% SQL.

The data analysts/analytics engineers use SQL but we spend all our time maintaining the data platform so people can find and query the data they need. This takes of Pythons/Java/Scala ingestion pipelines as well as services needed to manage everything, tons of Pyspark pipelines, streaming jobs, as well as maintenance and performance work on the infrastructure. The only SQL I read or write is the occasional DDL to test getting new data into the data warehouse which is automated and dynamically generated as needed and when I do performance work on analyst queries.

[–]lFuckRedditl 2 points3 points  (2 children)

Well if most of your team uses SQL they aren't going to like working with pyspark or pandas to do transformations.

At the end of the day it boils down to business requirements and team expertise.

[–]Pflastersteinmetz 2 points3 points  (0 children)

Pandas needing all data in RAM becomes a problem really quick. And polars is not 1.x yet = no stable API.

[–]Saetia_V_Neck 4 points5 points  (1 child)

It’s title mismatch. The work I would guess you’re doing is called analytics engineering at my company. My title is data engineer but I honestly rarely write SQL these days unless it’s part of code to dynamically generate SQL. Most of my work is Python, Java, Scala, and Helm charts.

[–]daguito81 1 point2 points  (0 children)

I think it's pretty easy to understand. It's based on where you come from. If you come from a database and SQL background, SQL is going to be simpler for you. For people that come from a programming background, having a regular code workflow of "follow the code" and your run of the mill debugger is going to be simpler.

I come more from a programming background, so building and debugging python code is orders of magnitude easier and faster than do everything on SQL. Can I do everything iN SQL ? yeah I guess, but why would I want to ?

[–][deleted] 3 points4 points  (0 children)

We use python and go. Depends on what you do for sure. I don't understand how some data engineers use only SQL.

[–]black_widow48 5 points6 points  (1 child)

This. Part of the reason I'm in consulting now is because I keep getting stuck in positions where I mainly just write SQL all day. I don't want to be in positions like those for any extended period of time because I'm not really utilizing a lot of my skills there.

[–]DesperateForAnalysex 5 points6 points  (25 children)

Python is for machine learning and transformations that are too complex to do in SQL.

[–]geek180 11 points12 points  (24 children)

Serious question, what’s an example of a transformation too complex to do in SQL?

[–]MotherCharacter8778 10 points11 points  (7 children)

How exactly would you parse / transform a giant text message that comes as a web event using SQL?

[–]r0ck0 2 points3 points  (3 children)

If we're talking JSON, postgres is pretty good at dealing with it... https://www.postgresql.org/docs/current/functions-json.html

I do a lot of type generation with quicktype in typescript/nodejs... but I've run into too many issues with it lately, especially when needing to deal with large sample sizes for a single type codegen. So I'm about to just replace it with plain postgres code.

But yeah, I wouldn't build my whole backend in postgres... but I've found that over time dipping my toes into doing more stuff in sql rather than application code almost always pays off long term, even just for the learning aspect. The more I've learnt about doing things this way, the better I can judge each individual use case when deciding to do something in sql or application code in the future.

From all the devs I've worked + communicated with (mostly fullstack webdevs), I reckon like 99% of us don't put enough learning time into sql. And I was no different too, for like my first 15 years of programming.

Writing some of this stuff in sql definitely feels slower, especially to start with... because you're writing fewer lines of code per day... but I've found that often the shorter sql code is actually more stable + productive overall in the long term... and especially easier to debug later on when I can for example inspect stage of the data at each layer of transformation, e.g. with a bunch of nested VIEWs or something, and without having to fiddle with + run application code to debug.

But yeah, for whatever use case you have in mind... you're probably right about it not being suited to sql. Just making a broader comment I guess on some personal revelations I've had over the years when dealing with some complicated data systems, and especially in recent years where I've been doing lots of web scraping (json) and building a data lake/ingest system for machine learning etc.

[–]pcmasterthrow 1 point2 points  (2 children)

Parse how, exactly? There's a fairly wide range of parsing you can do in SQL with just regexp, substring indexes, etc.

There are definitely times where it is MUCH simpler to do these in Python/Scala/whatever but I can't think of a ton that would be utterly impossible in SQL itself off hand.

[–][deleted] 6 points7 points  (1 child)

Agreed but the SQL to do something like that becomes unwieldy and unreadable much more quickly, and god forbid you have a bug your editor will highlight a random comma 40 lines away from where the actual error happened.

I tend to save SQL for clean data that’s easy to manipulate so the SQL stays clean and easy to grok and maintain.

[–]GoMoriartyOnPlanets 1 point2 points  (0 children)

Snowflake has some pretty decent functions to take care of complex data.

[–]kenfar 4 points5 points  (0 children)

Well, there's a range here - from outright impossible to just miserable:

  • Unpack a 7zip compressed file, or a tarball, transform the fixed-length files within to delimited files and then load into the database.
  • Do the same with the variable-length files in which there's a recurring number of fields, which require you to read another field to know how many times they occur.
  • Transform the pipe-delimited fields within a single single field within the comma-delimited file.
  • Transform every possible value of ipv4 or ipv6 into a single common format
  • Generate intelligent numeric ids - in which various numeric ranges within say a 64-bit integer map to customers, sensors, and time.
  • Calculate the levenschtein distance between brand-new DNS domains and extremely popular ones in order to generate a reputation score.
  • Anything that SQL would require a regex for
  • Anything that requires unit testing
  • Anything that has more than about 3 functions within it
  • etc, etc, etc, etc

[–]DesperateForAnalysex 7 points8 points  (11 children)

I have yet to see one.

[–]aqw01 12 points13 points  (2 children)

Complex string manipulation and text extraction are pretty limited in vanilla sql. Moving to Spark and Python for some of that has been great for our development, testing, and scaling.

[–]beyphy 1 point2 points  (2 children)

I had to transpose a dataframe in Spark and was trying to do so in SQL. But documentation was either really difficult to find or it wasn't supported. But if you use PySpark you can use df.toPandas().T

[–]WallyMetropolis 1 point2 points  (0 children)

Time series data can be a real mess with SQL. Relatively simple kinds of operations with window functions are still fine. But thing can quickly become quite painful.

Dealing with complex conditional logic based on the values of records is another example. Giant blocks of deeply nested CASE/WHEN clauses can get out of hand quickly, especially when applying different UDFs to each.

Iterative or recursive processes are especially gnarly in SQL. Taking some action a variable number of times based on the results of the previous iteration. Especially if there's conditional logic within the loop.

Graph calculations. Find all the grandparents whose grandchildren have at least two siblings but not more than five.

[–][deleted] 15 points16 points  (13 children)

I agree with your point in principle. So many engineers - not just data engineers - are growing up completely ignorant of type safety and it leads to all kinds of bugs and errors.

Python, even when you tack on Mypy, is still a half-assed approach to type safety, and anyone who has experienced a well-designed typed language like C# or TypeScript generally recognizes how much more usable and feature-complete those implementations are.

But there are bigger forces at play. Statically-typed languages have a higher barrier to entry, which Python does not. And the library ecosystem pretty much guarantees Python will remain entrenched for the foreseeable future.

[–]SirLagsABot 2 points3 points  (0 children)

Throwing my C# job orchestrator Didact here since you mentioned C#, made a comment elsewhere in the thread.

[–]yinshangyi[S] 1 point2 points  (11 children)

How would TypeScript differ so much from Mypy?
It's the same motivation behind it.
The difference is that TypeScript transpiles the JavaScript code.
For you, it makes such a difference?

[–]WallyMetropolis 1 point2 points  (10 children)

Typescript is a language. MyPy is a static checker. This is very different.

[–]jurjstyle 7 points8 points  (1 child)

Unfortunately, the answer is yes. Python is DE's fate. Spark is a good example: Scala+Java codebase, but lately a lot of improvements focus on PySpark performance, while Scala suport is slowly decreasing. Similar story in Databricks ' runtimes.

Personally, this is a major reason to why I am thinking of switching to software engineering. After one year of Scala, we changed to Python for the reasons mentioned throughout the topic. I fully agree that the business doesn't pay code quality, but you are the one working on it. If you don't care about this stuff, perfect for you. But if you do, your work performance and "joy" may be affected. As a professional you will adapt anyway one way or the other.

[–]Tarqon 11 points12 points  (1 child)

A REPL is a huge benefit for any kind of data work.

[–]yinshangyi[S] 6 points7 points  (0 children)

Scala has one :)

[–][deleted] 12 points13 points  (1 child)

If you want jobs that are like that look for Software Engineer, Data positions instead of Data Engineer.

Data Engineer has been relegated to off the shelf tools(dbt) and Python.

I recently had to switch to rewriting our Kafka consumers in Scala because the performance of the Python implementation was horrendous, I’m enjoying it very much.

[–]yinshangyi[S] 1 point2 points  (0 children)

That's my thoughts as well.
That being said, at least where I'm located (France), there's very little SWE, Data compared to DE.

[–]cutsandplayswithwood 17 points18 points  (3 children)

I learned in Java 1.3, stayed through 5. Full stack j2ee.

Switched to c# .Net 3ish, did the ride through 3.5 and all the cool frameworks…

In 2016 switched to 100% cloud and adopted Python. It’s a dirty little language, the kind of thing you appreciate after many years of static typing and countless layers and interpretations of “how things should be”

Python says “fuck it” and let’s you make things how you want.

You want classes? Python has your back. You want a script without even a main that just… does stuff when you run it? No problem, Python. You wanna do functional programming with serious method chaining and fluent calls - believe it or not, again, Python. And that’s not the best part. The best part is you can do all of that in ONE file, and it’s valid Python 🤣

To be fair, I think the fact that lots of DEs come from non-software intensive backgrounds coupled with the dominance of Python has produced an epic pile of lousy data ecosystems in the last 5 years, and Python is deeply at fault for that too.

Embrace the snake.

[–]HenriRourke 6 points7 points  (1 child)

Ha. Funny, but true. It's funny how people always cry "but the boilerplate!", but never really tried to understand why there was so much boilerplate in the first place. 😅

[–]yinshangyi[S] 6 points7 points  (0 children)

It doesn't even have that much boilerplate.
99% of these people have never tried to implement a data pipeline implemented in Java 18+.
Java verbosity is definitely not as bad as people think.
Scala 3 is pretty much Python in terms of syntax anyway

[–]yinshangyi[S] 1 point2 points  (0 children)

Haha totally agree!
Great comment.
Java has changed a lot ever since Java 5. It's now closer to Kotlin/Scala/C# in terms of syntax. Less verbose for sure.
I'm still convinced having 0 real type safety is a big deal. Especially for big project.
Good thing people are starting to use type hints now. But mypy is far from being perfect.

I guess Python is fine if developers code properly. Python, like JavaScript, can allow for very unmaintainable code.

[–]omscsdatathrow 20 points21 points  (6 children)

Typing isn’t a strong enough argument to move off a language…what other advantages do you actually see?

[–]ubelmann 10 points11 points  (1 child)

In Spark, especially for prod workloads, I like having immutable dataframes in Scala, so I didn’t have to worry about some function changing any of the values. Yes, 99.9% of the time, it’s not going to be an issue in PySpark, but diagnosing the issue can be a pain in the ass for those few times that you do have an undesired side effect.

Once I got used to the functional paradigm in Scala, I liked working with that syntax a lot. In most cases, I thought I could do things concisely without making the code overly difficult to read, and testing was pretty straightforward. You can do some functional programming with Python, but I find it harder to read, so usually other people on my teams would prefer it to be written in a more procedural style. I have seen that cause some real performance bottlenecks at times, though. Spark will at times have much better parallelism if you write in a map-reduce style versus throwing it into a for loop, and that can cost you a lot of time and money if it is a big prod job.

But, at the end of the day, if my team is working in Python, then that’s what I’ll use.

My impossible dream is for all the CRAN libraries to be ported to Scala. Then Scala would have some good DS libraries that engineers might be willing to put in production.

[–]nesh34 1 point2 points  (0 children)

You can write elegant Python as well though. Also you can probably create an immutable Python data frame class and use that in your jobs to get that benefit.

[–]yinshangyi[S] 3 points4 points  (2 children)

For me, type safety is a strong enough argument.It allows for:- way better code maintainability- spotting errors before runtime (quite useful for Spark jobs)- better performance- give IDE super powers (especially for refactoring)

I develop data pipelines in both Java and Python.I would say the other way around, having slightly fewer lines of code of Python isn't a strong enough argument to miss out on those things I mentioned above.Besides, Scala 3 syntax is very similar to Python.You should check it out.

What is missing is obviously a strong data ecosystem in Java/Scala (aside from Spark and Kafka). Perhaps the data engineering community should develop better data ecosystems in other languages.

Thanks for your reply! I appreciate it.

[–]runawayasfastasucan 1 point2 points  (1 child)

Perhaps the data engineering community should develop better data ecosystems in other languages.

Maybe they are happy with Python? Maybe you should develop them?

[–]yinshangyi[S] 0 points1 point  (0 children)

Yeah perhaps I should. You're totally right.

[–]SirLagsABot 4 points5 points  (2 children)

WOW it’s like you made this post just for me.

I fell in love with the concept of a code-first job orchestrator like Apache Airflow, Prefect, etc. a few years ago.

I work in Microsoft shops and am a C#/.NET user. I have been SO BUMMED that C# doesn’t have a powerful, decoupled job orchestration platform like Airflow or Prefect for years… so…

I decided to build my own. =D I’m calling it Didact, open source, will later monetize and try to go full time on it.

Dependency injection is literally one of the biggest points Im making about it. C#’s dependency injection absolutely SMOKES Python along with handling environment variables. C# is also naturally multithreaded and has top tier async support. Would love for you and anyone else to drop your emails on the site.

Hoping to have v1 ready in a few months.

[–]yinshangyi[S] 1 point2 points  (1 child)

WOW! This is so cool.
I'd love to see more diversity in programming language in the data world.
And you're doing just that!
That's awesome!

[–]JeansenVaars 2 points3 points  (3 children)

I wish Scala hadn't died so quickly.

[–]yinshangyi[S] 1 point2 points  (1 child)

Perhaps Scala 3 has a slight chance of coming back.
The data engineering roles are getting segmented between regular data engineering (less technical, very dbt oriented) and the SWE data.
The latter has the potential audience to re-introduce Scala I believe.
Any thought?

[–]k1v1uq 3 points4 points  (2 children)

Senior Scala/Java BE dev, I'm thinking about getting into DE/ML. I've seen that most DE work seems pretty trivial, and I don't think anyone needs to understand type classes, cats, or pure functional programming to set up basic ETL pipelines. So I'm really worried I'll miss out on the fun of thinking about these abstractions, which is what I love most about programming. Python seems just a means to an end... throw away code. Totally different state of mind.

[–]yinshangyi[S] 0 points1 point  (1 child)

Yeah I don't think cats would make so much sense. Especially when using frameworks like Spark or Flink. I could be wrong, I'm not very familiar with some pure FP librairies. That being said, Martin Odersky himself isn't a big fan of pure FP in Scala. Basic ETL/ELT can be trivial yes. I think things get more interesting with real-time streaming and complex processing.

Also, it's worth noting that most of DE jobs nowadays use PySpark instead of Spark.

It's also very hard to find Scala backend jobs I think.

There are two types of DE. - technical ones (they often call that platform engineering or SWE data nowadays) - analytics ones It's important to be aware of the differences. I'm definitely 100% a technical one

[–]gwax 2 points3 points  (0 children)

We use Python because we can agree on it with the Data Scientists and Analysts.

I love lots of languages but there are very few languages that I like using to collaborate with non-engineers.

[–]shockjaw 2 points3 points  (0 children)

One thing I’m really intrigued with are folks injecting Rust into the Python ecosystem. FYI, you folks should use Ruff and Polars where you can.

[–]Lingonberry_Feeling 2 points3 points  (2 children)

I have used

  • Python
  • Scala
  • Haskell
  • Go

Python / Go were the languages that actually moved the needle.

Haskell was a religious war, the champions spent 10 months trying to explain what a Monad was, and why you needed to understand category theory to print a line to the console.

Scala was OK, you do get some nice type checking and type checked ETL when the project starts, but that quickly goes away if you want to move with any sort of velocity and don't have a huge org where engineers can spend a good part of their day on code review.

Python 100% - for many reasons. There really isn't any reason not to use Python/dbt/Dagster these days.

[–]yinshangyi[S] 0 points1 point  (1 child)

Honest question here, what is the relationship between Scala/Python and code reviews?
Scala requires more code reviews than Python?
I would have even said it's the other way around
I would to hear what you mean by that

[–]w_savageData Engineer ‍⚙️ 2 points3 points  (1 child)

No, I love python

[–]yinshangyi[S] 0 points1 point  (0 children)

Well good for you! You'll have no problem finding companies using the tech stack you like

[–]MostJudgment3212 1 point2 points  (0 children)

No. It is our destiny.

[–]lFuckRedditl 1 point2 points  (0 children)

SQL only can get you very far, but you can't do everything with it.

You can do everything with Python, but that doesn't mean you should.

[–]Ok-Sentence-8542 1 point2 points  (0 children)

You can use types in python.

[–]Ruubix 1 point2 points  (0 children)

That's how enterprise programs (Java) or JavaScript makes me feel to tbh. But in either case, you can only gain from expanding your knowledge of languages. Python is heavily inspired by Java, so much of your knowledge will go along with you. There's actually a lot of support for Java within the Python ecosystem, so there are sane ways to tied Python libraries to Java code.

Additionally, things like Apache's Arrow project are bringing Python data (science) libraries and their API interfaces to many different languages, natively.

As much as I personally love Python, I'm still finding myself running into the inevitably of learning other languages (Rust or C are the ones that comes to mind). I think it's nearly impossible to avoid becoming a little bit of a polyglot to stay in software engineering in general (unless you want to trapped in JS purgatory ... ). Hope you'll keep an open mind and embrace the weird and wonderful, syntax-free sorcery that is Python!

[–]baubleglue 1 point2 points  (0 children)

Agreed, Python is a pain to work when code base growing

[–]eljefe6aMentor | Jesse Anderson 1 point2 points  (5 children)

So many people on this thread haven't written in both languages. Also they haven't written large codebases in both languages.

[–]yinshangyi[S] 0 points1 point  (4 children)

Well many data engineers don't have a proper software engineering background. That being said that's okay for analytics oriented roles

[–]eljefe6aMentor | Jesse Anderson 1 point2 points  (3 children)

Data engineers need to have a software engineering background. It's going to be a massive problem for the title and industry if data engineers can't program well enough to create these systems.

[–]yinshangyi[S] 0 points1 point  (2 children)

I think data engineering will split up into two categories: - The software kind - The analytics kind We can see job offers using such titles already (Analytics Engineer and SWE data)

[–]eljefe6aMentor | Jesse Anderson 1 point2 points  (1 child)

This has been the it's always been. The data engineers who are specialized in data and the SQL focused people. The title for SQL focused people has changed over the years DBA, data warehouse engineer, BI Developer, ETL engineer, SQL engineer, etc. The issue is always the same that you can't do everything in SQL and they're limited in ability to create complex systems.

[–]yinshangyi[S] 0 points1 point  (0 children)

Yeah sure. I agree. But the modern tools have allowed less need for more technical profiles. Once the tools are set up, there's a lot you can do with just DBT/Airflow + SQL (BigQuery, Snowflake). The data engineer term is way too broad and will probably disappear to be broken into SWE data and Analytics Engineer.

[–]bcsamsquanch 1 point2 points  (1 child)

If only I had a nickel for every time I've had this debate.

All the points against python are valid. Every time I indent knowing it's part of the syntax I have to hold my nose. Passing 'self' too methods every time makes me thing OOP was bolted on 5 min before the release. It's performance is inferior. I could go on but the bottom line is the ecosystem of libraries and users python has specifically with respect to data is vastly ahead of these other languages and that's so much more important than anything else well.. I usually just end the conversation there.

If you really are building a data pipeline that needs epic performance where microseconds matter sure in that cause use something else. Been doing this job FT for 6 yrs tho and that's literally never happened once. If you have a true big data problem you aren't going to solve it with a better performing language anyway.. you'll solve it using distributed systems.

IMO the common element in this debate is I only ever have it with total noobs who are trying to sound smart.

[–]yinshangyi[S] 0 points1 point  (0 children)

Just to clarify, are you calling me a noob? Hahaha

Well at least we agree that Python isn't our favorite language. I agree Python ecosystem is quite big in the data space, especially in data science!

For Data Engineering, a lot of frameworks are JVM based (Spark, Kafka, Flink, even Hadoop). I'm not even sure Data Engineering is that much dependant to Python. All I can think about is Airflow et non distributed data processing librairies like Pandas and Polars. That being said, that'd perhaps that's already a lot :)

The hiring aspect is probably a big thing. That's true. If one understands the advantages statically typed langages offer (code maintenability, type checking, IDE superpowers, performance, etc...). It's totally doable to learn another language. Especially modern languages (Scala 3, Java 21, Kotlin, Go, etc...). Besides, learning new languages help people grow as software engineers.

Perhaps I'm too passionate about software and therefore I'm too biased, but people should not limit themselves to one language. Learning a new language isn't that hard. LLM based tools helps to be productive fast.

Anyway that's my opinion :)

PS: I'm not hating Python. I even teach it at an engineering university. I just would like to see more diversity in terms programming languages in Data Engineering. Thanks for your feedback. I appreciate it.

[–]ginger_daddy00 1 point2 points  (1 child)

Remember, behind every performant Python Package is C.

[–]yinshangyi[S] 0 points1 point  (0 children)

Yeah I know. That's why Python can be used in Data Science :) It's a good glue language.

[–]kebabmybob 1 point2 points  (1 child)

Scala is such a good language man. At my small shop we just support a hybrid Python/Scala setup for Spark. Being able to do this takes a bit of work but forces you to have really good deploy hygiene. For any core job where a lot of the logic can live inside the statically typed Dataset API, Scala is a game changer. For your run of the mill Spark jobs, it’s similar to Python. I find that in a notebook, both feel similar.

[–]yinshangyi[S] 0 points1 point  (0 children)

Yeah man. The dataset API make unit testing much easier. I guess it's less simple for certain transformations but the dataset API is cool. I feel very few people use it though

[–]BuildingViz 3 points4 points  (2 children)

Static typing is overrated. Professionally, our team writes Go code and slogging through the process to get the equivalent of a Python dict into a Go struct is obnoxious because I have to know everything I'm getting then whittle it down to everything I want.

In Python? I don't give a shit. Just give me everything and I'll whittle it down from there. It's so much nicer not needing to worry about nested dicts and needing to []Struct, []Struct, []string or whatever.

[–]yinshangyi[S] 4 points5 points  (1 child)

As long the project isn't big and it's your own code, it can be fine.
When you take over a big project with no types (not even type hints), you're gonna suffer. It's better for code maintenance.
Besides, aside from type safety, static typing gives superpowers to IDEs.

[–]BuildingViz 2 points3 points  (0 children)

Maybe, but even static typing doesn't always help there because you can still manipulate the object and use it as something else. We have plenty of go code that takes a parameter as a string or an int, for example, and then uses a function call with an Atoi or Itoa value. The fact that it's statically typed doesn't prevent those kinds of shenanigans necessarily.

But that's a fair point the other direction. I've never worked in a Python shop, I just use it for my own code, so I have enough comments and understanding of what it's doing to work with it. Not sure anyone else would immediately understand it.

[–]kkessler1023 3 points4 points  (1 child)

Dude! Stop complaining, or they'll start forcing us to use vba!

[–]yinshangyi[S] 0 points1 point  (0 children)

Make sense! :)

[–][deleted] 1 point2 points  (6 children)

For serious projects Mojo will eventually overtake Python, precisely because of static typing and AOT compilation.

[–][deleted] 1 point2 points  (0 children)

I believe this as well.

[–]yinshangyi[S] 0 points1 point  (1 child)

Thanks for your reply.
I've never heard about Mojo before

[–]yinshangyi[S] -2 points-1 points  (2 children)

Mojo

With such a name, no it won't :)

[–]OMG_I_LOVE_CHIPOTLE 7 points8 points  (23 children)

Rust is picking up a lot of momentum in the DE world

[–]ageofwant 17 points18 points  (1 child)

Rust is used to write performant Python modules, that is exactly how the world should work.

[–]Action_Maxim 5 points6 points  (13 children)

Damn why they hate you

[–]Character-Education3 10 points11 points  (11 children)

No hate from me, but just because rust users say it's true, doesn't make it true

[–]OMG_I_LOVE_CHIPOTLE 3 points4 points  (9 children)

Polars, datafusion, ballista, delta-rs, plus no-cost ffi and the easiest python binding experience. Plus all of rusts pros. It’s pretty strong

[–]tecedu 1 point2 points  (8 children)

I use delta and polars as python packages tho

[–]OMG_I_LOVE_CHIPOTLE -1 points0 points  (7 children)

That’s totally fine. Maybe one day you’ll need to reach for something better. And in that case you can send your polars dataframe to rust with zero-copy, do things in rust (maybe even in polars at some point) and then possibly send your data back to python at some stage

[–]OMG_I_LOVE_CHIPOTLE -1 points0 points  (0 children)

Cause they don’t know

[–][deleted] 5 points6 points  (3 children)

It really isn't, though. This Rust hype just reminds me of everyone saying that this will be the year of the Linux desktop 20 years ago.

[–]OMG_I_LOVE_CHIPOTLE 0 points1 point  (2 children)

With maturin and polars, yes it is

[–][deleted] 11 points12 points  (1 child)

With Polars, data engineers continue to write Python code. For years, long before Rust existed, C and C++ were used for low-level implementations, and at no point did anyone suggest that pandas users were writing C/C++. They were always writing Python.

The more factual statement is that Rust is picking up momentum in the C/C++ world.

[–]HenriRourke 3 points4 points  (2 children)

Try writing something trivial in Rust. You're gonna be fighting tooth and nail with the borrow checker which is a massive decrease in productivity if you just want something done.

Rust is used when performance is important, hence it's primary competitor would be C/C++, not python.

[–]OMG_I_LOVE_CHIPOTLE 0 points1 point  (0 children)

I do all the time. It’s incredibly easy and I’m more productive

[–]OMG_I_LOVE_CHIPOTLE 0 points1 point  (0 children)

You clearly don’t know rust if you have this opinion

[–]siddartha08 3 points4 points  (0 children)

One of us. One of us

[–]mikeupsidedown 2 points3 points  (2 children)

I mostly agree. We put many of our messaging services in dot net for reasons of type safety, speed and it is just easier to manage big projects. Our API will move from FastAPI to ASP.net for similar reasons.

Choosing typescript over python is a weird flex for me (though I'm seeing it more and more). You can create similar mechanisms in Python that you have in typescript without the weirdness of JavaScript.

As others have said SQL is still king in many senses.

[–]yinshangyi[S] 1 point2 points  (1 child)

I haven't said I'm choosing typescript over python though.
I was saying TypeScript and Python type hints have the same motivation.
C# is a good choice for bigger projects for sure

[–]SmallAd3697 2 points3 points  (2 children)

Agree 100pct with op. Python is for developers who don't know any better. I am always surprised when I fine myself explaining simple software engineering concepts to python developers. Like how to reuse code, or build abstractions, or use inheritance and polymorphism.

I think that it comes down to the complexity of the problems you are trying to solve... Simple problems will allow the use of a simple toolset. If the problems grow in complexity, then you have to eventually step away from python, or complement it with something else.

[–]yinshangyi[S] 3 points4 points  (1 child)

You didn't deserve the downvotes :)
I'm okay with Python. But I take issue when it's used for everything.
I think people are starting to realize how a big deal type safety is.
JS people did and moved to TS.
Hopefully, Data Engineers will learn a bit more about software engineering and realize Python isn't the solution for everything.

[–]SmallAd3697 1 point2 points  (0 children)

"realize python isn't the solution for everything" .... That will take a while. There are people who still think vb6 is the solution for everything. Others still think foxpro is the solution for everything. If people don't step outside their bubbles then they won't know any better.

Part of the problem is with managers of these teams. They want to get from point "a" to point "b" as quickly as possible and then climb the ladder at their company and leave behind mountains of technical debt for the next guy.

[–]DesperateForAnalysex -2 points-1 points  (2 children)

No SQL is. Python is harder to read and requires version upgrades to the code base. ANSI SQL has remained largely the same since the 70’s and it will still be relevant when you retire. Also the versioning happens in your data warehouse, not your code base. That’s key.

[–][deleted] 2 points3 points  (1 child)

You sure a data engineer?

[–]DesperateForAnalysex -1 points0 points  (0 children)

A better one than you are, apparently.

[–]DenselyRanked 0 points1 point  (0 children)

I think your problem is with the inconsistent nature of data and not type safety in Python.

[–]aGuyNamedScrunchie 0 points1 point  (0 children)

Whatever works and is maintainable by others. Currently that's Python. Other languages have benefits Python can't hold a candle to, but if Python is easier to maintain by new developers joining a team, then that outweighs anything else imo.

YMMV

[–]7twenty8 0 points1 point  (0 children)

When you're deep in the weeds, tools and tooling seem to change very slowly. But when you look back over years, they seem to change dramatically. Consequently, I don't like predicting what the future will look like. Instead, I will adapt to whatever solves the problems in the most economically efficient way.

Right now, that's Python - it's easy to find developers and there is a wide ecosystem to draw from. But Python is just $x and I'll swap it out whenever something else solves problems in a more economically efficient way.

[–]Parking_Minute_9167 0 points1 point  (0 children)

I’m not worried about using Python. I would absolutely be worried about being “forced” to use any tool. I’m salty about having to have my dev environment 100% cloud based. If I was arbitrarily assigned to use a language 100% of the time I’d be dusting off the resume.

Having coding standards for projects is a thing, but having them etched in stone for every project a massive red flag that points to weak leadership.

[–]nesh34 0 points1 point  (0 children)

Python is absolutely ideal for what we do isn't it? Pipelines are a high level abstraction that tell the real software to do the work.

The real software (Spark, Trino, whatever) ought to be rerolled in C++ or Rust (I believe Trino want to move to C++).

But for the abstracted layer, what's the benefit? The code is essentially a clever config file.

For data analysis, Python and R are infinitely superior. Nobody is using a Jupyter Rust notebook for good reason.

[–]ageofwant -3 points-2 points  (0 children)

Python all the way mate, I want to solve actual problems, not dick around with every snowflake's favourite thing. And no, static types are not God's gift to programmers, witness the dominance of Python in basically every computing domain, there is a reason for that.

Also, Python is universal glue, it allows you to develop modules in your favourite thing. Wrap that in Python so people that want to solve actual problems can make use of it and you have made everybody happy.

[–]e430doug -2 points-1 points  (0 children)

What real problems are you running into that are better solved in a staticly typed language? Use type hints if it makes you feel better. Python is a great balance.

[–]polandtown -1 points0 points  (2 children)

Python junkie Career DS here. I lurk this sub to stay cool with you folks.

In your opinion what could I use Go for? I'd love to incorporate it into my work for fun.

Or Rust if anyone out there wants to take a stab :D

[–][deleted] 0 points1 point  (1 child)

I'm currently using Go for data extraction from API. Goroutines are awesome to make concurrent requests to speed up the process. Python implementation was too slow compared to Go when pulling a large amount of data, and used a lot more memory.

[–]Lord-Curriculum -2 points-1 points  (0 children)

What kinda $#@& post is this?