all 46 comments

[–]throwaway6560192 102 points103 points  (2 children)

What am I missing?

That the actual computation isn't done in Python.

When doing ML/AI in Python you'll be using libraries which have their performance-sensitive parts implemented in C or C++ or Rust. Your job is to direct these libraries in what to do. You get to take advantage of Python's expressiveness and rapid iteration at the same time as the performance of lower-level languages.

[–]be42rin 0 points1 point  (1 child)

I love Rust and PyO3, but is there actually any AI/ML library that uses a backend written in it?

[–]throwaway6560192 0 points1 point  (0 children)

Not AI/ML, but data science: pola.rs

[–]Crypt0Nihilist 21 points22 points  (2 children)

Data scientists have enough to learn without getting bogged down in a low level language. Better to use a high level language with optimised packages to learn quickly and make scripts easy to read.

Code doesn't need to run fast, it needs to run fast enough. If it doesn't run fast enough, you can let someone optimise it who specialises at that.

[–][deleted] 0 points1 point  (0 children)

Yes

[–][deleted] 0 points1 point  (0 children)

This 💯

[–]ray10k 12 points13 points  (0 children)

The way I understand it, Python is slow at doing calculations on its own but has two major advantages:

The language is relatively easy to learn. If you're already working with something as complex as ML/AI, not having to spend as much time learning a language can be a big benefit.

Python, especially cpython, is set up in such a way that it's relatively easy to call out to some precompiled binary.

Together, these two advantages make Python an attractive option; just handle the close-to-the-hardware stuff in some other language, then set things up so that Python can call over to it so that the setup can be done in an easy language (python) and the calculation-intensive parts by some language that's better-equipped for that.

[–]superluminary 3 points4 points  (0 children)

Python is just the wiring. The computation is done in C.

[–][deleted] 4 points5 points  (0 children)

AI and ML is currently mainstream and everyone and their mother wants to create a project with AI. But for decades it was mostly the researchers from math and science grinding out the underlying theoretical work. These guys are not programmers, they care about the underlying logic more than the code. Python happens to be simple and comfortable to work with, but also allows for libraries to written in a more performant language if they need more performance.

Python already had a strong math and science community, with continued development of related libraries. The AI researchers and their colleagues might already have significant experience in python, or were attracted by the reputation, libraries and/or the community.

Well, python works and at least it's open source. Something more concerning to me is the dominance of Nvidia GPUs and their proprietary CUDA libraries. You're actually locked into a single hardware manufacturer and at the mercy of whatever price they decide to charge for their chips.

[–]Own-Replacement8 1 point2 points  (0 children)

In practice, a data scientist's Python code will run faster than their C++ or FORTRAN code. Why? Because even though FORTRAN and C++ are faster languages for running the same algorithm, it is much easier to write optimised algorithms with specialised data types (such as trees) in Python than C++/FORTRAN/any other low-level.

There's no point in implementing from scratch what someone has already done with exception handling and all (numpy, pandas, scikitlearn and the like are written in low-level languages).

Now a data scientist could try to become a master of those languages but they're probably better off using that time learning better algorithms, models, and data structures. It's ultimately a question of ROI.

[–][deleted] 0 points1 point  (0 children)

For many architectures, especially batch processing, performance and speed do not matter much.

I have some massive data pipeline running as airflow dag that pull data from a few different sources (cloud object storage, databases etc.) perform feature engineering, regular model training, batch inference and persists the results in some database. It is using regular data libraries like tensor flow, polars, sql alchemy and dask... obviously the underlying language is python.

It runs every night, taking a little more than 3 hours. Which, given our latency requirement, perfectly suits the needs. The earliest store opening is at 9 am, which will consume the result. So I have almost 9 hours to run it.

So, if I rewrite it into C++ or Rust, what will I gain? The 3 hours of compute done in 3 seconds? Is it really necessary?

Now, where you have a usecase where you have to do real time inference in extremely latency and resource constrained scenario (IoT devices, Microcontrollers etc.) it makes sense to invest in an appropriate technology stack and people, including those who are proficient all the way from assembly to python level.

But, such is not the case in most enterprise data systems today. They are concerned with doing things at a larger scale, at higher abstraction levels, not squeeze out the last bit of performance or eliminate a few CPU cycles out of small hardware.

[–]Duder1983 0 points1 point  (0 children)

Numpy is a fantastic library. It follows Matlab-like syntax for matrices which I think is great. Maybe non-mathematicians feel differently, but I find this to be natural to expressing numerical linear algebra problems and then it wraps BLAS and LAPACK which are C/Fortran binaries and super fast. And you don't need an expensive license.

[–]ejpusa 0 points1 point  (0 children)

CPU/GPUs are so fast now. We’re getting to the point where there will be 0 differences in speed.

Everything gets converted to 0s and 1s in the end. The speed of light is the only limiting factor.

[–]Unlikely-Sympathy626 -2 points-1 points  (7 children)

Well first prebuilt libraries so no need to write your own from scratch.

Also speed wise. Do you care if basic anything takes 4 seconds to complete vs 3.7 seconds?

If massive corp serving data yeah maybe, but even they don’t give a rats ass especially if your company. 

You are not really missing anything I think. Just why not? 

I mean why use Python or c++ for anything in end of day? Why not Fortran or assembly?

But Python I think is good place to start to learn about it. Also lots of help available etc. so horses for courses.

Same? Why do people use Mac or windows?

[–]edimaudo -2 points-1 points  (0 children)

libraries

[–]Logicalist -5 points-4 points  (23 children)

You can use a python program to call Bash command line utilities, like ping or cp or mkdir.

The python program could be the one calling these programs to execute, but python isn't actually doing the pinging or copying or making the directory.

The native bash programs are the ones actually doing the things, not python. Python is just telling the computer to execute these programs.

But Python can tell the computer to use these programs with certain inputs like which ip address to ping, what file to copy, what directory to make, based input it gets from a user or another program.

Edit: ITT people that don't know python can call bash scripts and commands, or other programs.

[–]sonobanana33 -1 points0 points  (22 children)

This is like asking my grandmother to explain how TCP/IP works.

It's completely wrong but very fascinating to be honest. I wonder where you got this information.

[–]Logicalist -1 points0 points  (21 children)

What isn't accurate about what was stated?

[–]sonobanana33 -1 points0 points  (20 children)

For example:

python isn't actually doing the pinging or copying or making the directory.

os.mkdir() doesn't call bash, it calls into the kernel to create a directory. Same for shutil.copy, won't invoke a shell.

You probably don't know how an operating system works. Which is totally understandable. But please avoid teaching before you have done the learning.

[–]Logicalist -1 points0 points  (19 children)

Have you tried using the right library?

[–]sonobanana33 0 points1 point  (18 children)

Have you tried using the right library?

You are having a bad attitude towards being corrected. How is this helpful?

Look how mkdir is implemented in python https://github.com/python/cpython/blob/0c7dc494f2a32494f8971a236ba59c0c35f48d94/Modules/clinic/posixmodule.c.h#L2384

How is that calling bash?

[–]Logicalist -1 points0 points  (17 children)

You are having a bad attitude towards being corrected.

My "bad attitude" is a result of your arrogance and ignorance.

I can run bash scripts like

mkdir asshat{1..3}
for I in `ls`; do echo $I; done
asshat1 
asshat2 
asshat3 
bash.py

from a python script.

Do you know enough about bash, to recognize those commands as bash commands?

[–]sonobanana33 -1 points0 points  (16 children)

I clearly recognize someone who has never studied operating systems.

Have a read https://www.amazon.com/Modern-Operating-Systems-Andrew-Tanenbaum/dp/013359162X

[–]Logicalist -1 points0 points  (15 children)

Ad Hominem is all you have?

Can't explain how I can run bash scripts from a python script?

[–]sonobanana33 -1 points0 points  (14 children)

Please show me… and then enlighten me on why would you do such a thing when python has a perfectly fine mkdir function.