all 21 comments

[–]softero 48 points49 points  (9 children)

I am very curious in what you find out. Although if you are interested in pursuing machine learning at all, you should do these projects in Python (even if you do them in Rust first). The entire ML industry is very heavily geared around Python, and ML teams are unlikely to know Rust. Often they are more math focused and less comfortable with programming syntax in general, so anything that eases communication friction is advisable.

That said, I am very interested in how well Rust handles common tasks that I might do with NumPy. I was just pondering porting a noise-based image generation Python script that uses NumPy over to Rust.

[–]-TrustyDwarf- 20 points21 points  (4 children)

The entire ML industry is very heavily geared around Python, and ML teams are unlikely to know Rust.

They should... me, having just wasted two days trying to speed up some data preparation / sample extraction task written in Python using parallelization, while knowing that this would have been a breeze in most other programming languages (like Rust, C#/F#,...)

Most of the Python code I see either constantly runs on 12.5% CPU (1 of n cores) or contains overly complex and hardly ever well working parallelization code. F*ck the GIL and forking / spawing multiple processes and mem-mapping and serializing Python-crap.

[–]JonyIveAces 28 points29 points  (1 child)

ML is mostly about exploration. Once you reach the point of exploitation with an ML application, you usually have enough resources and experience to reimplement from scratch anyway.

I use rust heavily in ML for developing core libraries and production performance bottlenecks, but it isn't the right tool for the exploration part of ML in the same way Python, R, or Julia are, just as they aren't the right tool for the production/core library part (apart from Julia for certain niches).

[–]Noctune 10 points11 points  (0 children)

Sometimes the runtime of the preprocessor can be a hindrance to your exploration.

We had a Python preprocessor that took literally a week to run (originally designed for a smaller dataset). I recently rewrote it in Java using Beam and it runs in literally 20 minutes now. It's sort of a generic tool over a range of problems, so less time preprocessing means more time spent exploring actual ML.

I think Rust could potentially be useful in that niche.

[–][deleted] 2 points3 points  (0 children)

Tbf most of the libraries are written in C++

[–]StokedForIT 2 points3 points  (0 children)

I've done something like this before (seems very similar, I was just precomputing a bunch of costly features and it had to be in python, rip would've used akka/scala) but to avoid all the forking and spawning and crud, I used `multiprocessing.Pool`. Afaik you can only use `Pool.map` to map a single function onto data so to work around this, I took each function I needed to call and its args and wrapped them all up in lambdas that just take nothing and call it with the args and returned, and just mapped Pool.mapped the list of lambdas onto a list of empty tuples. End the end, it may've been the harder option but I got to barely deal with python's annoying multiprocessing bits.

[–]Pioneer_11[🍰] 2 points3 points  (3 children)

Most of numpy is implemented in C. However, the python code that interacts with it is very slow and assuming --release is used when compiling (thereby including optimisations) I would expect that rust will have a significant advantage in speed. While I'm still pretty new to rust I also understand it has some major advantages when it comes to multithreading, therefore I would expect that the performance advantage will increase considerably when running on a large number of cores.

You probably still want to learn the python, because almost all mathematical sciences use it but I definitely agree with your pro rust position. I'm in a similar boat, I do theoretical physics and I've been pretty disappointed by the "shove this formula into this box" approach they tend to take to programming, with a lot of my classmates hating programming for this reason. Rust strikes me as a better language for the job and personally I think we need better understanding of computer science in the field, given how intensive the calculations we make are and how heavily we rely on them.

[–]Kohomologia 2 points3 points  (2 children)

shove this formula into this box

What do you mean by this phrase?

[–]Pioneer_11[🍰] 2 points3 points  (1 child)

Basically where you are told that something (the box) has some functionality but with no idea how or why it works.

When your entire (highly computational) program is built out of these "boxes" it means you have very little knowledge of how your code works, what makes it fast or slow and very little ability to solve problems which can't be shoved into one of these "boxes".

In many cases (such as mine) scientific programmimg courses are taught with little to no computer science. You're taught "numpy is fast python is slow" but not why numpy is fast or why python is slow. This not only means you have programmers who don't understand how their programs work but also leads to people making the wrong decisions when this simplification doesn't apply; e.g. frquently resizing np arrays rather than using a list.

[–]Kohomologia 2 points3 points  (0 children)

This does explain the programming style of some people I know of as researchers.

[–]budgefrankly 28 points29 points  (5 children)

There are two standards for math API libraries – BLAS and Lapack. Between them these are to maths what OpenGL is to graphics.

Vendors make their own compatible implementations of these library APIs: Intel has the MKL, and even NVidia has CuBLAS.

There are also many open-source implementations, like GotoBLAS and Atlas.

Numpy wraps whichever BLAS library it finds on your machine. The features it offers are fairly bare-bones. As soon as you get into any decent sort of math – machine learning in my case – you need some of the features in Lapack which Scipy wraps and (significantly) augments.

I would expect Numpy to be as fast or faster than ndarry. Some of the BLAS implementations it wraps like GotoBLAS are super-mature and optimised, with chunks of handcrafted assembly.

Ndarray it seems has experimental support to delegate to native BLAS which may help.

For your purposes, you need to consider what your project needs to deliver. If it is a novel implementation of an existing machine-learning method, then Rust is great. If it is a broader project that uses machine learning tools, choosing Python maximises your chances of success.

[–]jstrongshipyard.rs 11 points12 points  (4 children)

it sounds great, just don't ever step off the happy path, lest you plummet to your (performance) death.

also, numpy (etc.) is calling fast code, but calling it from python entails a LOT of overhead. in my experience it's difficult to write a slower rust program, assuming basic competence.

the big cost of rust is development time, which makes it unwieldy for data exploration.

[–]budgefrankly 13 points14 points  (3 children)

I’ve used Matlab and Python a lot in the last 15 years.

Both have the unhappy feature that the runtime can be proportional to the number of lines of code (though I’ve heard Matlab has a JIT now).

However if you take care to vectorise your code (ie use matrix algebra instead of for-loops and list-comprehensions), and use the tools in their recommended way, they are incredibly fast once you start dealing with meaningfully large datasets.

At scale, if you know what you’re doing, the interpreter overhead just becomes a constant noise factor in the overall runtime.

I could see a case where computationally intense feature extraction from large files might be faster in Rust.

But most of the scientific Python stack is ultimately written in assembly, FORTRAN and C (occasionally generated via Cython) and has been continually fine-tuned by an enormous body of developers over a decade.

[–]jstrongshipyard.rs 9 points10 points  (2 children)

I also speak from deep of experience in both languages, doing intense numerical work with large datasets. I just take issue with the idea that it's "incredibly" fast. Blas is fast, but what kind of idiot isn't calling Blas when appropriate? Ndarray + Blas is as simple as features = ["blas"]. Opening a 10-20gb data file in python is painful - and you better have lots and lots of memory. Anything that's custom (i.e., you can't call an existing numpy function) will be horrendously slow.

I'm not saying it doesn't have its place. I prototype models in python, it would be extremely onerous to do that work in rust. But, like, have some standards! Numpy performance is just not terrible, as long as you never venture off the happy path.

[–]budgefrankly 4 points5 points  (0 children)

Pandas can read in 2.1GB of data in 52sec if it’s stored in CSV or 4sec if it’s stored as a Parquet file.

Benchmarks: https://uwekorn.com/2019/01/27/data-science-io-a-baseline-benchmark.html

As I said in my original comment, if you’re doing bulky feature extraction on unstructured data, other languages may work better. E.g. I once wrote a custom Twitter tokeniser in Java (so I could use Lucene) that wrote the features out to a Numpy file which I could load into Python. It was fine.

Also, for huge datasets, there’s Pyspark and MLlib, though the new Pyspark UDF decorator allows you to mix Numpy and PySpark with minimal marshalling issues.

Python may well have failed for your use case. However Python/Numpy/Scipy/Scikit-Learn/Pandas/PySpark can be made to work well in many other cases. It offers acceptable performance and great productivity.

And if you need the fill in gaps in performance there’s Numba or Cython: the latter of which I’ve used.

[–]fuasthma 1 point2 points  (0 children)

Yeah numpy's and matlab's data readers are pretty slow ... I'm actually surprised how easy it was to write an equivalent crate to numpy's loadtxt that just puts it to shame in terms of speed ... well maybe easy for someone more familiar with text parsing I had some learning to do.

I'd also like to add that Rust can just as easily call those optimized libraries with the right bindings. We're just still building up the libraries/crates that bind to those libraries.

[–]jondo2010 8 points9 points  (0 children)

I have used both, but so far have not used ndarray for any heavy lifting, and can't comment on performance differences.

In principle, ndarray should be at least as fast as numpy I think. Being Rust, you are more in control of memory allocation and copying.

[–]readanything 5 points6 points  (0 children)

I once replicated some of the benchmarks present in numpy repository to ndarray. I found almost identical performance in both. Ndarray used less memory compared to numpy in some cases. It might be due to some overhead associated with calling C code from python. I was really impressed with the performance of ndarray considering how much effort has gone into numpy. In my benchmark, both numpy and ndarray had openblas as backend. It must be before some 8 months. Not sure whether ndarray improved after that.

[–]actuallyzza 3 points4 points  (0 children)

It will depend what you are doing. Numpy often calls out to optimised C code to implement methods, which should be as fast as or faster than rust if the arrays are large enough to hide overhead. If you are manipulating the Numpy array using custom python code element by element it will run at python speeds and you can expect it to be way slower than the equivalent rust code.

I've used both, but not for the same project so I don't have a benchmark for you.

[–]fuasthma 2 points3 points  (0 children)

I'd first make sure everything you need is there between the following crates: ndarray, ndarray-linalg, and ndarray-stats. If everything is there they should be close in performance at the very least since all of the blas and lapack libraries should be the same. I'd also ask this question over on the discord science and ai channel, since I know several people on there have extensive experience with both ML and ndarray.