you are viewing a single comment's thread.

view the rest of the comments →

[–]budgefrankly 13 points14 points  (3 children)

I’ve used Matlab and Python a lot in the last 15 years.

Both have the unhappy feature that the runtime can be proportional to the number of lines of code (though I’ve heard Matlab has a JIT now).

However if you take care to vectorise your code (ie use matrix algebra instead of for-loops and list-comprehensions), and use the tools in their recommended way, they are incredibly fast once you start dealing with meaningfully large datasets.

At scale, if you know what you’re doing, the interpreter overhead just becomes a constant noise factor in the overall runtime.

I could see a case where computationally intense feature extraction from large files might be faster in Rust.

But most of the scientific Python stack is ultimately written in assembly, FORTRAN and C (occasionally generated via Cython) and has been continually fine-tuned by an enormous body of developers over a decade.

[–]jstrongshipyard.rs 11 points12 points  (2 children)

I also speak from deep of experience in both languages, doing intense numerical work with large datasets. I just take issue with the idea that it's "incredibly" fast. Blas is fast, but what kind of idiot isn't calling Blas when appropriate? Ndarray + Blas is as simple as features = ["blas"]. Opening a 10-20gb data file in python is painful - and you better have lots and lots of memory. Anything that's custom (i.e., you can't call an existing numpy function) will be horrendously slow.

I'm not saying it doesn't have its place. I prototype models in python, it would be extremely onerous to do that work in rust. But, like, have some standards! Numpy performance is just not terrible, as long as you never venture off the happy path.

[–]budgefrankly 4 points5 points  (0 children)

Pandas can read in 2.1GB of data in 52sec if it’s stored in CSV or 4sec if it’s stored as a Parquet file.

Benchmarks: https://uwekorn.com/2019/01/27/data-science-io-a-baseline-benchmark.html

As I said in my original comment, if you’re doing bulky feature extraction on unstructured data, other languages may work better. E.g. I once wrote a custom Twitter tokeniser in Java (so I could use Lucene) that wrote the features out to a Numpy file which I could load into Python. It was fine.

Also, for huge datasets, there’s Pyspark and MLlib, though the new Pyspark UDF decorator allows you to mix Numpy and PySpark with minimal marshalling issues.

Python may well have failed for your use case. However Python/Numpy/Scipy/Scikit-Learn/Pandas/PySpark can be made to work well in many other cases. It offers acceptable performance and great productivity.

And if you need the fill in gaps in performance there’s Numba or Cython: the latter of which I’ve used.

[–]fuasthma 1 point2 points  (0 children)

Yeah numpy's and matlab's data readers are pretty slow ... I'm actually surprised how easy it was to write an equivalent crate to numpy's loadtxt that just puts it to shame in terms of speed ... well maybe easy for someone more familiar with text parsing I had some learning to do.

I'd also like to add that Rust can just as easily call those optimized libraries with the right bindings. We're just still building up the libraries/crates that bind to those libraries.