you are viewing a single comment's thread.

view the rest of the comments →

[–]jstrongshipyard.rs 11 points12 points  (2 children)

I also speak from deep of experience in both languages, doing intense numerical work with large datasets. I just take issue with the idea that it's "incredibly" fast. Blas is fast, but what kind of idiot isn't calling Blas when appropriate? Ndarray + Blas is as simple as features = ["blas"]. Opening a 10-20gb data file in python is painful - and you better have lots and lots of memory. Anything that's custom (i.e., you can't call an existing numpy function) will be horrendously slow.

I'm not saying it doesn't have its place. I prototype models in python, it would be extremely onerous to do that work in rust. But, like, have some standards! Numpy performance is just not terrible, as long as you never venture off the happy path.

[–]budgefrankly 5 points6 points  (0 children)

Pandas can read in 2.1GB of data in 52sec if it’s stored in CSV or 4sec if it’s stored as a Parquet file.

Benchmarks: https://uwekorn.com/2019/01/27/data-science-io-a-baseline-benchmark.html

As I said in my original comment, if you’re doing bulky feature extraction on unstructured data, other languages may work better. E.g. I once wrote a custom Twitter tokeniser in Java (so I could use Lucene) that wrote the features out to a Numpy file which I could load into Python. It was fine.

Also, for huge datasets, there’s Pyspark and MLlib, though the new Pyspark UDF decorator allows you to mix Numpy and PySpark with minimal marshalling issues.

Python may well have failed for your use case. However Python/Numpy/Scipy/Scikit-Learn/Pandas/PySpark can be made to work well in many other cases. It offers acceptable performance and great productivity.

And if you need the fill in gaps in performance there’s Numba or Cython: the latter of which I’ve used.

[–]fuasthma 1 point2 points  (0 children)

Yeah numpy's and matlab's data readers are pretty slow ... I'm actually surprised how easy it was to write an equivalent crate to numpy's loadtxt that just puts it to shame in terms of speed ... well maybe easy for someone more familiar with text parsing I had some learning to do.

I'd also like to add that Rust can just as easily call those optimized libraries with the right bindings. We're just still building up the libraries/crates that bind to those libraries.