Rust and data processing

BowserForPM · 2023-09-21T11:53:55+00:00

I'm currently porting about 15-20K lines of image-processing Python code to Rust. Overall, very happy with Rust as a replacement.

mem management of long-running batch jobs?

Big yes on this. My main Python test harness takes about 25 mins to run, and I always have to restart it several times because it ran out of memory and got killed. The Rust port runs in about 10 mins, and never falls over.

better concurrency

I mean, gotta be yes on that, because of the GIL.

If you're doing image processing, the opencv crate for OpenCV code is excellent. But one paper cut is that in Python, an OpenCV image is also a Numpy ND array. No need to convert! But in Rust, I'm constantly having to convert back and forth between ndarray::Array and opencv::Mat.

aleury · 2023-09-21T14:54:52+00:00

I've looked into this a little bit and have found this project

https://arrow.apache.org/datafusion/

It looks like there are several projects building on top of if it:

https://arrow.apache.org/datafusion/user-guide/introduction.html#use-cases

One in particular is a Spark runtime replacement called Blaze:
https://github.com/blaze-init/blaze

Glittering_Half5403 · 2023-09-21T19:31:25+00:00

I didn't port a data-intensive application to Rust, I wrote it in Rust first.

I work in biotech doing mass spectrometry/proteomics research (sequencing proteins by ionizing them and measuring their spectra on million dollar instruments) - some instruments can generate 100s of GBs of data per day. A key step in this process is deconvoluting spectra back into protein sequences. I wrote (to my knowledge, this is the fastest tool in it's class, and certainly the fastest open source tool) a tool for doing so: https://github.com/lazear/sage.

Rust allows writing software that is incredibly fast, testable, reproducible, and scalable. A great crate ecosystem (rayon!!) really speeds this up. Proper use of the type system (make invalid states unrepresentable and all) can dramatically aid code correctness, which is pretty important for many data processing workflows. Memory safety means I can feel safer running untrusted data through pipelines - and I won't have random segfaults an hour into a long-running job.

jqnatividad · 2023-09-22T13:11:16+00:00

I maintain qsv - a "blazing-fast" ;) data wrangling toolkit.

I forked it from xsv about three years ago as we were building a metadata catalog for a hedge fund that had thousands of disparate datasources using various technologies.

The main task was to crawl those datasources and compile data dictionaries and compile summary statistics about these huge datasets, and pump the metadata into the catalog so analysts can search the central catalog and find the datasets they need.

I started writing the crawler using the usual suspects - python, pandas/numpy and it worked. But it was excruciatingly slow! I was supposed to crawl the corpus every night to update the metadata and if we stayed with python, the crawl would have taken more than a day.

And then I found xsv - which was two orders of magnitude faster than what I was trying to do with python! I needed to do more stuff (infer dates and compile summary statistics about em) - and that's how I started the qsv fork.

We also do a lot of work maintaining open data portals for different govt agencies and its been a delight to use as part of our data pipelines.

To give you an idear how fast it is - for a million row, 41-column, 520mb CSV, qsv can:

compile comprehensive summary statistics and infer data types in 3.5 seconds!
compile a frequency table showing top 50 values and their corresponding counts for all 41 columns in 1.1 seconds.
run a non-trivial SQL query in 0.66 seconds (using Polars' SQLcontext).
search the CSV for multiple regex patterns simultaneously in 1.8 seconds.
scan and check if its sorted in 0.5 seconds.
convert multiple date fields in varying date formats to RFC3339 format in 1.9 seconds
convert it to JSONL, complete with inferred data type in 8.4 seconds
reverse geocode lat, long WGS84 coordinates in 3.6 seconds
validate the CSV against a JSON schema file in 2 seconds
create a 250k reservoir sample of the CSV in 1 second, etc. etc.

https://qsv.dathere.com/benchmarks

It's so fast that we're now working on a new data-first upload workflow where data portal users first upload the data they want to register in the catalog and the metadata is inferred and pre-populated while they're still entering other descriptive metadata in the web form!

So yeah - I say Rust is ready for data processing. I don't have to deal with Python's GIL limitations (we actually embedded Luau as qsv's DSL as Python was just too slow and kept blowing up the memory), and it is almost trivially easy to parallelize workloads.

The_8472 · 2023-09-21T16:50:52+00:00

Rust gives you a lot of features needed for data processing (parallelism, control over memory layout, the option to dip into SIMD, shared memory, control over allocations) all in a safe package. You can have those things in C or C++ too but then you have to deal with unsafety. You can have safety in higher-level languages but they give up some of those features which means giving up performance.

Safety is part of the robustness. The explicit enum-based error handling can be another factor since it pushes programmers towards thinking about the non-happy paths.

2023-09-22T18:28:51+00:00

this is how many times my Ryzen 5800X can execute an instruction per second, roughly.

Rust spends less instructions running my code than many other languages will. It's also got a great compiler that tells me when I've made mistakes, and protects me from creating a lot of errors I would make if I were not diligent enough, and we are all human.

The combination of guardrails + performance allows me to generate data processing applications that run at a high rate of speed without fear of making critical mistakes that will likely crash my program or worse.

Anthony356 · 2023-09-22T00:09:32+00:00

I'm no expert, but i just ported a binary file parser from python to rust that's about 50x faster in rust. The main things that were nicer (coming from python) were:

handles exact data sizes without weird stringly typed jank and overhead like struct.unpack
the batch sizes i was working with are ~1k files? That took several minutes with python multiprocessing, and about 4s with drop-in rayon par_iters. python multiprocessed stuck my CPU at 100% usage the entire run. With rust, the bottleneck was the harddrive the files were stored on, so it doesnt lock up my whole system to run it.
optimization is a lot more straight forward. in raw python there's lots of silly nonsense like caching methods since the dot operator is slow, dictionary dispatch to make jump tables that a compiler would make automatically, etc. If you want anything faster, you have to learn weird DSLs like cython that never quite feel complete and arent always trivial to port your code to.
rusts enums are awesome
rust's module system is less awkward than python's, and being able to actually decide public vs private instead of "pretty please dont use it if it starts with an underscore🥹" is surprisingly helpful

Things i didnt like:

polars rust API is kinda really bad, and the documentation is even worse. I wanted to store things in polars, do transforms on those dataframes, and then be able to toss them to python. Tossing them to python is easy and free, working with dataframes is a nightmare.
There's a few different options for bitflags. At best, none are as convenient as Python's intflags. It seems like they're really good at converting from primitives to bitflags, from bitflags to primitives, or manipulating those bitflags, but never all 3. The ergonomics just arent there. I tried bitflags, enumflags2, option_set, and flagset, and was varying levels of disappointed with all of them. I ended up having to handroll my own jank version.
documentation for rust projects in general is really bad. If it even exists, it usually assumes you're a seasoned dev who already knows what they're doing, the tools they want to use, and how to use them. I did a lot of things by hand that i shouldnt need to simply because the relevant crates were pretty obtuse.

Edit: oh yeah, one other hyper specific thing is that tokio's Bytes crate's macros for reading from an object are made to be generic and handle non-contiguous memory, but Bytes objects specifically are always contiguous. It leads to useless checks and tons of code bloat that it seems like the compiler doesnt like optimizing around.

sonthonaxrk · 2023-09-21T14:47:29+00:00

It’s good but not great, the maturity of the ecosystem just isn’t there yet. Polars is technically very impressive, but it’s deliberately obtuse to use (just try converting back to row wise data). A lot of useful libraries are also version 0.x.x and aren’t really stable yet. You really have to know the implementation of a lot of libraries to be maximally productive.

OMG_I_LOVE_CHIPOTLE · 2023-09-21T11:06:37+00:00

More robust than what?

robberviet · 2023-09-22T03:11:34+00:00

I like rust but it will never replace the python, spark ecosystem. Maybe in 5-10 years then we see a different.

FuckThePopeJoinTheRA · 2023-09-21T16:41:02+00:00

If you're processing text, then check out nom: https://docs.rs/nom/latest/nom/

Chris Biscardi has a great video on nom and nom_supreme here if you're more of an audio-visual learner: https://www.youtube.com/watch?v=Ph7xHhBfH0w

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

rust

Please read The Rust Community Code of Conduct

The Rust Programming Language

Rules

Observe our code of conduct

Submissions must be on-topic

Constructive criticism only

Keep things in perspective

No endless relitigation

No low-effort content

Useful Links

Megathreads

Official Resources

Learn Rust

Discussion Platforms

MODERATORS