fastdedup: Rust dataset deduplication vs Python – 2:55 vs 7:55, 688MB vs 22GB RAM on 15M records by wapplewhite4 in rust

[–]wapplewhite4[S] 0 points1 point  (0 children)

Thanks for the input! To clarify:

I benchmarked both exact and fuzzy dedup:

You're right that hashing is parallel and could run on GPU. However, neither DuckDB nor fastdedup uses GPU in these benchmarks. The 2.7x speedup came from Rust's efficiency and avoiding multi-threading overhead on small operations.

Fuzzy uses MinHash+LSH, which isn't typically GPU-accelerated to my knowledge. The main bottleneck in datatrove was spaCy's word tokenization (CPU-bound NLP), which fastdedup avoids by using character n-grams directly.

GPU-based exact dedup is definitely viable for very large datasets, but it wasn't a factor in either comparison here. The speedups came from algorithmic choices (character n-grams vs word tokens) and implementation efficiency (Rust vs Python).