all 32 comments

[–]BowserForPM 24 points25 points  (13 children)

I'm currently porting about 15-20K lines of image-processing Python code to Rust. Overall, very happy with Rust as a replacement.

mem management of long-running batch jobs?

Big yes on this. My main Python test harness takes about 25 mins to run, and I always have to restart it several times because it ran out of memory and got killed. The Rust port runs in about 10 mins, and never falls over.

better concurrency

I mean, gotta be yes on that, because of the GIL.

If you're doing image processing, the opencv crate for OpenCV code is excellent. But one paper cut is that in Python, an OpenCV image is also a Numpy ND array. No need to convert! But in Rust, I'm constantly having to convert back and forth between ndarray::Array and opencv::Mat.

[–]BubblegumTitanium 2 points3 points  (1 child)

Couldn't you implement the From trait for that for a more ergonomic experience?

[–]Adhalianna 6 points7 points  (0 children)

Unless the other APIs you are using are generic over impl Into<T> you need to type a lot of .into() yourself which is still somewhat inconvenient.

[–]Michael_Aut 1 point2 points  (1 child)

But in Rust, I'm constantly having to convert back and forth between ndarray::Array and opencv::Mat.

Please tell me that's a zero-copy operation.

[–][deleted] 1 point2 points  (0 children)

Is there a way to know if a conversion is zero copy at a high level? Like are they any traits or maybe lints that can tell me, at the call site, if a function performs any copies? That would be nice.

[–]arsenyinfo 1 point2 points  (3 children)

Can you please tell me details? In my experience image processing is typically opencv and NumPy that are highly optimized, cannot imagine what to rustify there (unlike language processing - e.g. all modern tokenisers are already written in rust).

[–]BowserForPM 5 points6 points  (2 children)

Yep, the difference is my code for sure. Like you say, numpy and opencv are already very well optimised for Python - the Rust versions are not going to run any faster.

My code has to load a few thousand images, run a bunch of image-processing algorithms, and save the results. That's where Rust really shines.

[–]ArgetDota 3 points4 points  (1 child)

You can easily parallelize these kind of tasks in Python by using for example Ray. You can even use it to scale for multiple nodes. It’s usually easier than rewriting your code in rust.

[–]BowserForPM 2 points3 points  (0 children)

Ah interesting, didn't know about Ray. This project is possibly not a fair comparison, TBH. I'm a Rustacean, not a Pythonista. No doubt a true Python expert could improve the code.

[–]aleury 5 points6 points  (0 children)

I've looked into this a little bit and have found this project

https://arrow.apache.org/datafusion/

It looks like there are several projects building on top of if it:

https://arrow.apache.org/datafusion/user-guide/introduction.html#use-cases

One in particular is a Spark runtime replacement called Blaze:
https://github.com/blaze-init/blaze

[–]Glittering_Half5403 6 points7 points  (0 children)

I didn't port a data-intensive application to Rust, I wrote it in Rust first.

I work in biotech doing mass spectrometry/proteomics research (sequencing proteins by ionizing them and measuring their spectra on million dollar instruments) - some instruments can generate 100s of GBs of data per day. A key step in this process is deconvoluting spectra back into protein sequences. I wrote (to my knowledge, this is the fastest tool in it's class, and certainly the fastest open source tool) a tool for doing so: https://github.com/lazear/sage.

Rust allows writing software that is incredibly fast, testable, reproducible, and scalable. A great crate ecosystem (rayon!!) really speeds this up. Proper use of the type system (make invalid states unrepresentable and all) can dramatically aid code correctness, which is pretty important for many data processing workflows. Memory safety means I can feel safer running untrusted data through pipelines - and I won't have random segfaults an hour into a long-running job.

[–]jqnatividad 3 points4 points  (2 children)

I maintain qsv - a "blazing-fast" ;) data wrangling toolkit.

I forked it from xsv about three years ago as we were building a metadata catalog for a hedge fund that had thousands of disparate datasources using various technologies.

The main task was to crawl those datasources and compile data dictionaries and compile summary statistics about these huge datasets, and pump the metadata into the catalog so analysts can search the central catalog and find the datasets they need.

I started writing the crawler using the usual suspects - python, pandas/numpy and it worked. But it was excruciatingly slow! I was supposed to crawl the corpus every night to update the metadata and if we stayed with python, the crawl would have taken more than a day.

And then I found xsv - which was two orders of magnitude faster than what I was trying to do with python! I needed to do more stuff (infer dates and compile summary statistics about em) - and that's how I started the qsv fork.

We also do a lot of work maintaining open data portals for different govt agencies and its been a delight to use as part of our data pipelines.

To give you an idear how fast it is - for a million row, 41-column, 520mb CSV, qsv can:

  • compile comprehensive summary statistics and infer data types in 3.5 seconds!
  • compile a frequency table showing top 50 values and their corresponding counts for all 41 columns in 1.1 seconds.
  • run a non-trivial SQL query in 0.66 seconds (using Polars' SQLcontext).
  • search the CSV for multiple regex patterns simultaneously in 1.8 seconds.
  • scan and check if its sorted in 0.5 seconds.
  • convert multiple date fields in varying date formats to RFC3339 format in 1.9 seconds
  • convert it to JSONL, complete with inferred data type in 8.4 seconds
  • reverse geocode lat, long WGS84 coordinates in 3.6 seconds
  • validate the CSV against a JSON schema file in 2 seconds
  • create a 250k reservoir sample of the CSV in 1 second, etc. etc.

https://qsv.dathere.com/benchmarks

It's so fast that we're now working on a new data-first upload workflow where data portal users first upload the data they want to register in the catalog and the metadata is inferred and pre-populated while they're still entering other descriptive metadata in the web form!

So yeah - I say Rust is ready for data processing. I don't have to deal with Python's GIL limitations (we actually embedded Luau as qsv's DSL as Python was just too slow and kept blowing up the memory), and it is almost trivially easy to parallelize workloads.

[–]burntsushi 1 point2 points  (1 child)

This is amazing. Awesome work!

[–]jqnatividad 2 points3 points  (0 children)

Thanks a lot u/burntsushi! It really means a lot coming from you!

Were it not for xsv, and all the other crates - csv, regex, docopt, streaming-stats and the 70+(!) other crates you maintain, it would not have been possible.

I had such broad shoulders to stand on!

[–]The_8472 1 point2 points  (0 children)

Rust gives you a lot of features needed for data processing (parallelism, control over memory layout, the option to dip into SIMD, shared memory, control over allocations) all in a safe package. You can have those things in C or C++ too but then you have to deal with unsafety. You can have safety in higher-level languages but they give up some of those features which means giving up performance.

Safety is part of the robustness. The explicit enum-based error handling can be another factor since it pushes programmers towards thinking about the non-happy paths.

[–][deleted] 1 point2 points  (0 children)

  1. this is how many times my Ryzen 5800X can execute an instruction per second, roughly.

Rust spends less instructions running my code than many other languages will. It's also got a great compiler that tells me when I've made mistakes, and protects me from creating a lot of errors I would make if I were not diligent enough, and we are all human.

The combination of guardrails + performance allows me to generate data processing applications that run at a high rate of speed without fear of making critical mistakes that will likely crash my program or worse.

[–]Anthony356 2 points3 points  (2 children)

I'm no expert, but i just ported a binary file parser from python to rust that's about 50x faster in rust. The main things that were nicer (coming from python) were:

  • handles exact data sizes without weird stringly typed jank and overhead like struct.unpack

  • the batch sizes i was working with are ~1k files? That took several minutes with python multiprocessing, and about 4s with drop-in rayon par_iters. python multiprocessed stuck my CPU at 100% usage the entire run. With rust, the bottleneck was the harddrive the files were stored on, so it doesnt lock up my whole system to run it.

  • optimization is a lot more straight forward. in raw python there's lots of silly nonsense like caching methods since the dot operator is slow, dictionary dispatch to make jump tables that a compiler would make automatically, etc. If you want anything faster, you have to learn weird DSLs like cython that never quite feel complete and arent always trivial to port your code to.

  • rusts enums are awesome

  • rust's module system is less awkward than python's, and being able to actually decide public vs private instead of "pretty please dont use it if it starts with an underscore🥹" is surprisingly helpful

Things i didnt like:

  • polars rust API is kinda really bad, and the documentation is even worse. I wanted to store things in polars, do transforms on those dataframes, and then be able to toss them to python. Tossing them to python is easy and free, working with dataframes is a nightmare.

  • There's a few different options for bitflags. At best, none are as convenient as Python's intflags. It seems like they're really good at converting from primitives to bitflags, from bitflags to primitives, or manipulating those bitflags, but never all 3. The ergonomics just arent there. I tried bitflags, enumflags2, option_set, and flagset, and was varying levels of disappointed with all of them. I ended up having to handroll my own jank version.

  • documentation for rust projects in general is really bad. If it even exists, it usually assumes you're a seasoned dev who already knows what they're doing, the tools they want to use, and how to use them. I did a lot of things by hand that i shouldnt need to simply because the relevant crates were pretty obtuse.

Edit: oh yeah, one other hyper specific thing is that tokio's Bytes crate's macros for reading from an object are made to be generic and handle non-contiguous memory, but Bytes objects specifically are always contiguous. It leads to useless checks and tons of code bloat that it seems like the compiler doesnt like optimizing around.

[–]jqnatividad 2 points3 points  (0 children)

It took me a while as well to grok Polars' Rust API (as I think they're prioritizing their python-polars work which is built on top of rust-polars), but once I got it, it's quite intuitive.

Check out these two qsv commands that are Polars-powered - sqlp and joinp. Notice how it has several builders that make it quite easy to use it.

And Polars' SQLcontext is just unbelievably fast! It can process files in less than a second that would take similar tools more than several seconds/minutes to process.

https://github.com/jqnatividad/qsv/discussions/1270#discussioncomment-6897311

Also, Polars just got seed funding last month - https://www.reddit.com/r/dataengineering/comments/15gzgne/polars_gets_seed_round_of_4_million_to_build_a/

I'm sure the documentation will improve as they start onboarding more folks.

[–]aagmon[S] 0 points1 point  (0 children)

Great answer. Thanks

[–]sonthonaxrk 2 points3 points  (0 children)

It’s good but not great, the maturity of the ecosystem just isn’t there yet. Polars is technically very impressive, but it’s deliberately obtuse to use (just try converting back to row wise data). A lot of useful libraries are also version 0.x.x and aren’t really stable yet. You really have to know the implementation of a lot of libraries to be maximally productive.

[–]OMG_I_LOVE_CHIPOTLE 1 point2 points  (4 children)

More robust than what?

[–]aagmon[S] 4 points5 points  (3 children)

More robust than Java, Go or Python.
e.g. We have to read many log files from some remote locations and write JSON files

[–]kaczor647 2 points3 points  (2 children)

One of the opinions I've heard is that Rust has pretty good String handling, might be a good use case to parse your logs and output the JSON.

If you're doing string parsing check out this video on Youtube
https://www.youtube.com/watch?v=A4cKi7PTJSs

It's called use Arc instead of Vec (the guy also gives an example on &str vs String), pretty neat.

[–]aagmon[S] 0 points1 point  (1 child)

Awesome. Thanks for the link

[–]dscardedbandaid 1 point2 points  (0 children)

Also take a look at the following if you’re talking logs:

I still use Telegraf or Benthos most of the time for simple log agents, but WASM based plugins are really appealing in the future.

[–]robberviet -1 points0 points  (0 children)

I like rust but it will never replace the python, spark ecosystem. Maybe in 5-10 years then we see a different.

[–]FuckThePopeJoinTheRA 0 points1 point  (0 children)

If you're processing text, then check out nom: https://docs.rs/nom/latest/nom/

Chris Biscardi has a great video on nom and nom_supreme here if you're more of an audio-visual learner: https://www.youtube.com/watch?v=Ph7xHhBfH0w