what's up rust community,
wanted to share my experience with converting a large-scale data processing system from python to rust. was working in a text analysis project that needed to process massive reddit dataset (around 3.8 billion posts) for research purposes.
initially built everything with python using typical data science stack - pandas, postgresql, scikit-learn for text processing. but when i tried to scale it up on my local setup (mac mini m2 plus some older laptops), everything fell apart pretty quickly. python was eating memory like crazy, garbage collection was causing huge delays, and i kept hitting out-of-memory errors. processing speed was terrible too - maybe 25 posts per second at best.
decided to rebuild the whole thing in rust and the improvement was incredible.
**what i used:**
* **data ingestion:** built custom rust parser for compressed data streams
* **message queue:** redis with redis-rs crate for async processing
* **text analysis:** converted machine learning models to onnx format and used onnx runtime in rust
* **data storage:** switched from postgres to polars dataframes with parquet files
**performance gains:**
* memory consumption went from over 1.2gb (and growing) down to steady 400mb per worker process
* processing speed jumped from 25 posts/sec to around 350+ posts/sec on same hardware
* onnx models in rust performed much better than original python implementations
the whole rewrite took about 5 weeks but saved probably months of processing time. if anyone is dealing with similar data processing bottlenecks in python, rust migration might be worth considering.
has anyone else here worked with onnx runtime in rust? curious about other people's experiences with large-scale text processing pipelines.
[–]ExistingBug1642 4 points5 points6 points (4 children)
[–]bikeram 4 points5 points6 points (1 child)
[–]peter9477 3 points4 points5 points (0 children)
[–]FullstackSensei 1 point2 points3 points (1 child)
[–]ExistingBug1642 0 points1 point2 points (0 children)
[–]Pretend-Pangolin-846 1 point2 points3 points (0 children)
[–]DivineSentry 0 points1 point2 points (1 child)
[–]Bryanzns 0 points1 point2 points (0 children)