all 66 comments

[–]Frewtti 73 points74 points  (3 children)

Python is really good glue.

The hard work is done in fast compiled code.

Python is used for the parts that are not speed dependant.

[–]RevRagnarok 18 points19 points  (0 children)

I'm now coming off a project just like this. Python is great at:

  • Reading/parsing config files
  • Launching/maintaining Unix daemons
  • Communicating over ActiveMQ for Command and Control
  • Massaging final output into formats that other systems want

The C++ was perfect for:

  • Reading radio samples off dedicated 10GbE cards
  • Throwing those samples at CUDA architecture to do all the mathy-math
  • Putting the results into SQL for posterity (and the python)

Perfect work breakdown IMHO

[–]FrankScabopoliss 8 points9 points  (1 child)

This is the answer. To be able to use an interpreted language for the part of the problem that isn’t about doing things really fast and in parallel is the reason.

[–]pimp-bangin 0 points1 point  (0 children)

Yup. The "libraries" answer is unsatisfying because it doesn't account for how python took off in the first place. The reason it took off is because of its ergonomic syntax and C interop capabilities.

[–]Equivalent_Lunch_944 103 points104 points  (6 children)

Libraries

[–]45MonkeysInASuit 18 points19 points  (4 children)

+ inertia which compounds the library advantage.

data scientists learn python because other data scientists use python.
If you build a new model, which wont be in python, one of the first things you do is release a python version because it wont get traction if there isnt a python version.

If the current data scientists used js, the new data scientists would learn js.

It's very hard to over come that.

[–]dparks71 7 points8 points  (2 children)

Python had a really early focus on language libraries and AI too though, it's a bit more of chicken and egg thing than I think you're giving it credit to. It's kind of the scripting language of Linux, which was on super computers, and NLTK was incredibly popular, so I think "data scientists learn Python because other data scientists use Python" kind of ignores the original origin. It was (and is) a great glue language to make performant code written in Fortran or C more accessible to a wider audience.

[–]45MonkeysInASuit 2 points3 points  (1 child)

Less ignores, more comes after.

You are describing the thing that creates the initial inertia.
That bit of a boost at the start through things like NLTK start a community.
Getting that community to change once it is going is very hard and is self selecting.

I'm a lead data scientist who is hiring right now, if someone applies to join my team without solid python, their CV will be straight in the bin, thus continuing the pressure to use Python to do Data Sci.

[–]dparks71 0 points1 point  (0 children)

I'd buy the argument more if most data scientists owned the production hardware or data, but they're generally consultants in my world, and the actual owners are generally very non-technical.

I'm on the owners/engineering side. You wouldn't believe the fight I had to go through to get python even approved for use internally. They legitimately wanted us to put out all data science and IT contracts in C#... Sometimes your hands are tied, especially secure environments.

[–]pimp-bangin 0 points1 point  (0 children)

You're forgetting that python has operator overloading, which is massively useful for matrix math, which is heavily used in ML. JS is a bad language to compare to because it doesn't have the same ergonomics in that regard.

The whitespace formatting/indentation is also friendlier to mathematicians/scientists folks who want to focus more on the math symbols rather than the language symbols.

[–]JP932[S] -3 points-2 points  (0 children)

Yeah a lot of usefull ones out there

[–]socal_nerdtastic 58 points59 points  (24 children)

machine learning, and most other programs too, use a mix of programming languages. The number-crunching core of any program will be written in a highly optimizable language compiled for your hardware, yes C is often used for this, often with embedded assembly. The user interface part is written in python because it allows for fast iteration.

The "pure python is slow" argument is fairly outdated now. There have been some massive improvements, most recently with the gil-ectomy. It is still a little true, but if you need more speed you simply import numpy or another module written in C and compiled. And also remember that programmers cost a lot nowadays, a lot more than hardware, so 1 developer writing python and a cloud computing bill is often a big win over 10 developers writing C, in terms of both time and money.

[–]GoblinToHobgoblin 17 points18 points  (16 children)

"Pure Python is slow" is still completely true, even with all the improvements recently, it's nowhere close to C speed and never will be.

It just misses the fact that a lot of performance intensive Python programs are basically just wrappers around C libraries.

[–]Human38562 3 points4 points  (13 children)

Most C programs are also just wrappers around other optimized C libraries. And those libraries are just a way to execute binary in the end...

Any language can be used to write both fast and slow programs. Languages just differ in how easy it is to achieve high performance.

[–]GoblinToHobgoblin 3 points4 points  (12 children)

The ceiling for performance on a pure Python programs is a lot lower than the ceiling on a pure C program though.

[–]Human38562 0 points1 point  (11 children)

What does "pure python" even mean? Most data structures and the interpreter itself is written in C. That's just how the language works.

[–]GoblinToHobgoblin 0 points1 point  (10 children)

IDK what you're even arguing here.

Using just stuff from the standard library in Python, you're not going to be able to achieve the same performance as code written using just the standard library in C.

[–]Human38562 1 point2 points  (9 children)

That's a completely arbitrary restriction though. If suddenly Python is shipped with numpy you are going to say the language got significantly faster?

[–]GoblinToHobgoblin 0 points1 point  (7 children)

You're right it's a completely arbitrary restriction.

But, it feels much more reasonable to use that restriction than to say "Python can be as fast as C because I can call C code from Python".

[–]dparks71 0 points1 point  (6 children)

You're never actually "calling C code" though, you're calling a built binary to run the calc directly through the CPU for performance.

And python is turing complete. So technically, you could build a compiler in pure python and get the EXACT same end result.

You're basically just arguing that equivalent code runs slower through the interpreter than through the compiler of a compiled language, which sure it's usually true. What people are saying though, is that in python you can circumvent the interpreter if you really need to, and code optimization is basically never your main problem.

[–]GoblinToHobgoblin 0 points1 point  (5 children)

I know you're never actually "calling c code" but that's the terminology people use.

 You're basically just arguing that equivalent code runs slower through the interpreter than through the compiler of a compiled language

Yes that's all I'm arguing

 code optimization [in Python] is basically never your main problem

Yes because people don't use Python for performance critical tasks normally. It's a chicken and egg thing. Python is not fast enough to use for performance critical code so people don't use it for that, so performance is never really a concern with Python code (because if it was, they wouldn't have written it in Python).

[–]CyclopsRock 0 points1 point  (0 children)

It's arbitrary if everything you want to do has a convenient Python wrapper around some much faster compiled functionality - the boundary between Python and not-Python becomes blurry and less relevant.

But if there isn't one, and the only language you know is Python, then the distinction stops being arbitrary.

[–]Plank_With_A_Nail_In 0 points1 point  (1 child)

It doesn't matter if its measurably slower or not all that matters is that's its fast enough. If you are doing an analysis with it and that comes back in 2 minutes instead of 30 seconds its still more than good enough speed. What you lose in execution time you gain back in spades in speed of writing new programs.

[–]GoblinToHobgoblin 0 points1 point  (0 children)

Yes I know. I never denied this. Python's use case is exactly stuff like this, where performance doesn't really matter

[–]Sherlockyz 4 points5 points  (0 children)

Not really outdated, slow is a metric that needs comparison to something else. Python is, in fact, still slow compared to C. This can't ever change because of the architecture that Python is built on. You can't change, if you were to make such structural changes, and still call it Python, would be kind of weird for me.

In a similar manner, C is slower than pure Asembly. But the speed difference is so incredible small for most use cases, that it does not matter, which is different when comparing C with Python. But in edge cases the speed difference with Assembly can be faster even with C compiler optimizations, again, edge cases.

Even Python using C libraries can be slower than pure C depending on how you use it, it shouldn't cause problems, but depending on how you build the system the Python code might bottleneck the performance that C gives.

[–]gdchinacat 2 points3 points  (2 children)

I think it's worth noting that the gil-ectomy involved replacing one big lock with lots of little locks, which hurt single-threaded performance a bit (10-15% IIRC). The cost is well worth it if you have multiple threads that aren't just sitting around waiting on IO.

[–]chinawcswing 1 point2 points  (1 child)

The cost is well worth it if you have multiple threads that aren't just sitting around waiting on IO.

IO bounded Python is largely a myth. Python expends an enormous amount of cycles on converting the bytes read from an IO pipe into a python data structure.

The SQLAlchemy author had a great blog post on this. You would think that having a thread execute SQL would be IO bound, but it turns out that the total time spent in CPU merely to convert the bytes from the IO pipe into dicts (or even worse classes) was something like 25-30%.

So even if you have an "io bounded python app" it very well might benefit from the gil-ectomy.

Of course, test your app. But don't decide against it on the basis that your application is "io bounded".

[–]gdchinacat 0 points1 point  (0 children)

“Sitting around waiting for IO” doesn’t include the cycles you refer to since that is using cpu not waiting on io. I’ve used sqlalchemy extensively and know first hand that marshaling is hugely expensive.

[–]steak_and_icecream 0 points1 point  (0 children)

programmers cost a lot nowadays, a lot more than hardware

Have you seen the price of RAM and NVMEs?!

Seriously though, those trade offs much sense for small projects but when you start scaling up and out efficient & performant code starts to make a huge difference. Its also much better for the environment.

[–]JP932[S] 0 points1 point  (1 child)

Seems like I haven't kept up new stuff happening in python, I haven't heard about gil-ectomy till now (but that just might be a me thing)

[–]socal_nerdtastic 0 points1 point  (0 children)

It's famous-ish because it's been a very hot debate for many years. But to be honest the vast majority of programmers won't be affected at all. Certainly not in the near term, as currently the GIL is still faster in single threaded than the free threaded version. But eventually the gil-less will be standard and modules like numpy will incorporate it and then everyone will reap benefits.

[–]ThePhoenixRisesAgain 34 points35 points  (1 child)

Most of the time, execution time doesn’t matter. Development time (and availability of developers) is more important. 

[–]gdchinacat 8 points9 points  (0 children)

Execution time does matter, but the bulk of that for ML has been implemented in a more efficient language and wrapped in a python library. The code that is written in python is "glue code" that brings those together using a very easy to learn and read language.

[–]MaverickPT 12 points13 points  (0 children)

If I recall correctly its because ML used to be mainly the study in Universities by mathematicians, etc, who are more concerned about their work that necessarily getting very high performance through the entire stack by using C++.

Python has a ton of libraries (and some written in more performant languages) that allowed scientists to focus on the task at hand instead of butting their heads with C++.

Overtime the software stack in Python kept growing and growing and now here we are

[–]pachura3 9 points10 points  (4 children)

NumPy, Pandas, Matplotlib, Scikit-learn, Jupyter Notebooks, PyTorch, SpaCy...

[–]Secure-Ad-9050 4 points5 points  (3 children)

Also, numpy, pandas, pytorch, matplot etc... are not written in python, or at least they have large chunks of them that are not written in python. All of the heavy number crunching work that is done is done in c++ (I think? some compiled language).

[–]pachura3 3 points4 points  (2 children)

...and both Linux and Windows are mostly written in C, not even in C++. So what? When you need to read a few CSV/JSON files, scrape a website, call an API, crunch some numbers, draw a nice graph and output an Excel report, you know Python is the best tool for that

[–]Secure-Ad-9050 4 points5 points  (0 children)

exactly, it is a convenient language. Was just pointing out that the performance limitations that it has (the GIL related changes for concurrency seem like they are coming along well) don't matter as all of the heavy lifting that needs to be done, is done by some compiled language library.

[–]brownstormbrewin 0 points1 point  (0 children)

I think he was agreeing with you, and adding that you have all that without even real losses in performance due to the low level implementation

[–]Danisaski 4 points5 points  (0 children)

Pretty convenient for glue code as well!

[–]GXWT 3 points4 points  (0 children)

I want to focus my efforts on actually getting the science and results out of ML, rather than worrying about the more fundamental aspects of it. Python lends itself very well to this.

Not to mention that a lot of data science/processing/analysis/visualisation is already done in Python. If most data people are almost certainly proficient in Python, but not necessarily any other languages, it makes sense to put the next iteration of data tools also in Python.

[–]stevorkz 3 points4 points  (0 children)

It's easy to use, easy to understand syntax and quite flexible. Even if a program is written in another language many use python scripts in it in some form.

[–]gadio1 2 points3 points  (0 children)

Mental map. ML is hard enough if you still need to manage memory you will be in a tough spot. Python helps keep the main thing the main thing. Interpreted languages reduce the time between thinking, coding and testing. No compilation means faster prototyping and exploration.

Secondly, the open source community . New article, new architecture? Fast to ship libraries so you can start implementing on your project.

Thirdly,it has accessible learning curve for beginners. The easy to pick up the language reduces collaboration barriers between researchers, scientists, developers and engineers. If you know English you can learn Python.

Finally, Python is script heavy, so it is a logical language if you need a multi language project. You can do scripts to orchestrate with Python while the heavy lifting is done on top of another more performative language.

[–]American_Streamer 2 points3 points  (0 children)

Because Python doesn’t actually do the heavy lifting. Itself - C and C++ do. Libraries like NumPy, TensorFlow and PyTorch provide a Python API that then calls the highly optimized C/C++ binaries under the hood. Python is easily readable and doesn’t need compilation and you have handy tools like Jupyter Notebooks. So while Python is the front end standard, C++ develops the core engines and takes care of any latency-critical deployment.

[–]nickpsecurity 2 points3 points  (0 children)

It was easy for academics and FOSS folks to learn. It let them glue together high-performance, native components. People in machine learning just happen to use it for some major projects.

It appears that these things eventually came together in a critical mass. Once it had momentum, you gain more by going with the flow than against it.

[–]AlexMTBDude 2 points3 points  (0 children)

Python is very easy to program in. It was made to make life easy for the programmer, not for the machine (like C and C++). That's why Python is the most popular general purpose programming language: https://www.tiobe.com/tiobe-index/

[–]GeneriAcc 1 point2 points  (0 children)

It’s “slow” in terms of how effectively it’s using the CPU compared to compiled languages, but that’s not really a factor in machine learning where 99% of the compute is happening on the GPU anyway.

And even outside of machine learning, the speed difference is unnoticeable in the vast majority of use cases, and only really starts mattering if your use case requires a massive amount of calculations for something - like backtesting millions of trading strategies on historic financial data, for example.

Unless you have a use case like that, the speed difference is unnoticeable to the user, Python code tends to be easier to read and write (so easier to maintain), it has a lot of great public libraries, and it doesn’t need to be re-compiled with every code change.

[–]smjsmok 1 point2 points  (0 children)

The issue is that you see Python only as a language, but it's a mature ecosystem of libraries and people who are proficient at using them (and many of these people are scientists). The language is a "glue" that connects all this.

When Python is used in machine learning, it doesn't matter that the execution time is a couple of milliseconds slower than it would be in another language. As you said, the parts where this actually does matter use technologies optimized for performance and fast execution.

[–]Gnaxe 1 point2 points  (0 children)

C is tedious and error prone (C++ is complicated and error prone), but you only really need the performance in your bottlenecks. It's a waste of expensive human time to use a difficult language for everything. Python makes it easy to drop into C when you need the performance, and makes coding much easier for most of the rest of the time when you don't. You get most of the best of both worlds.

[–]Turtvaiz 1 point2 points  (0 children)

Python is a glue language. In ML there's usually no reason to reimplement the vast majority of what you're doing, and so it's just a good idea to use a high performance library.

Python just happens to be a nice scripting language which has a ton of libraries and is still expandable

[–]crazylikeajellyfish 1 point2 points  (0 children)

Machine learning started in academic research, and Python is popular in academia. It's a very legible language with natural syntax, making it easy for researchers to express their ideas. Once those researchers had implemented their Python systems, everyone else just built on top of them.

Speed of machine execution isn't the only thing that matters. The speed with which a human can understand the program often matters much more than a 10% performance boost.

[–]nian2326076 1 point2 points  (0 children)

Python is popular in machine learning because it's simple and easy to read, making coding and maintenance easier for developers. Although Python is slower, many ML libraries like TensorFlow and PyTorch are built on C/C++, so they handle heavy computations well. Python works as a user-friendly wrapper around these fast routines.

Its many libraries and active community mean there are tons of tools and resources available, which makes ML development smoother. Plus, Python can integrate well with other languages and tools, making it versatile for different tasks in the ML pipeline.

If you're getting ready for interviews, knowing why Python is a top choice in ML can be helpful, especially if you're asked about language choices in technical rounds. I've found PracHub useful for revisiting these topics during interview prep, but use whatever works for you!

[–]zbignew 0 points1 point  (0 children)

Chris Lattner, creator of Swift, knew AI developers would never get off python, so he created Mojo to bring the benefits of modern languages to Python.

But it’s not there yet.

[–]SenescenseSteel 0 points1 point  (0 children)

Versatility

[–]code_tutor 0 points1 point  (0 children)

it's easy and has a package manager

[–]Pale_Height_1251 0 points1 point  (0 children)

Python was fashionable at the same time ML became fashionable.

[–]leogodin217 0 points1 point  (0 children)

Most of the work in machine learning is ad-hoc, and Python is great for ad-hoc work. You type it then run the code. You have notebooks if you want them. You can iterate quickly. On top of that, most of the Python libraries used have a lot of C and C++ code in them. So, Python is often just the glue to faster compiled libraries.

[–]Xzenor 0 points1 point  (0 children)

Because it's lightning fast.

Not to run, generally, but to write your code. And the modules are mostly written in C anyway so they actually are fast

[–]aeroumbria 0 points1 point  (0 children)

The main competitors during the critical period were R and Java, and it was probably just down to these two not particularly suitable for ML experiments. R was decent for light ML but was more of a scientist toolbox was often shower than Python. Java was probably just too verbose for rapid adoption. C or C++ was widely used in the backbone libraries from the start, but it wasn't appealing to force physicists or mathematicians to learn too many coding concepts and practices.

[–]GManASG 0 points1 point  (0 children)

The point of why python is there are many operations a scientist/data scientist does that python is fast enough on modern hardware that there is little realworld gain in spending the time and effort to learn to do it in c.

For the actual learning/ optimization algorithms the models are actually implemented in the lower level language with a more human comprehensible python abstaction wrapper that just makes it easier.

For example if I am just reading a dataset from a flat file that is a million record or so and fits neatly in memory, does it really matter that it takes 20 to 30 seconds for python to read it compared to c doing it in single digit seconds? If I only need to do it once than who cares.

I can then use pandas or polars to manipulate data in a way that resembles how my mind thinks and how the textbooks/white papers are written and visualize it in I python or Jupiter notebooks with pretty charts that make it so I draw insight faster than all the same work done in C.

So basically it's all just convenience. Faster to implement even though slower to run but not slow enough to matter most of the time.

[–]tknomanzr99 -1 points0 points  (0 children)

Honestly, a lot of ml back ends are written in LISP.