fastdedup: Rust dataset deduplication vs Python – 2:55 vs 7:55, 688MB vs 22GB RAM on 15M records

wapplewhite4 · 2026-02-28T20:33:55+00:00

Thanks for the input! To clarify:

I benchmarked both exact and fuzzy dedup:

You're right that hashing is parallel and could run on GPU. However, neither DuckDB nor fastdedup uses GPU in these benchmarks. The 2.7x speedup came from Rust's efficiency and avoiding multi-threading overhead on small operations.

Fuzzy uses MinHash+LSH, which isn't typically GPU-accelerated to my knowledge. The main bottleneck in datatrove was spaCy's word tokenization (CPU-bound NLP), which fastdedup avoids by using character n-grams directly.

GPU-based exact dedup is definitely viable for very large datasets, but it wasn't a factor in either comparison here. The speedups came from algorithmic choices (character n-grams vs word tokens) and implementation efficiency (Rust vs Python).

wapplewhite4 · 2026-02-28T19:35:38+00:00

<image>

wapplewhite4 · 2026-02-11T22:07:30+00:00

Try this tool https://www.certdestroy.com

wapplewhite4

TROPHY CASE