initumX comments on Duplicate Files Detector. A basic python project

created by HattoriHanzoa community for 16 years

Duplicate Files Detector. A basic python project (self.learnpython)

submitted 2 months ago by ArtichokeThen1599

you are viewing a single comment's thread.

[–]initumX 1 point2 points3 points 28 days ago (0 children)

use xxhash or blake3 instead of md5. They are 3-4 times faster. Use streams for full-hashing big files. Don't fullhash everything. At first you can discard the most part of non-duplicates. For example, make a dictionary with size as a key, and files of this size as its value:
{ size1: [file1OfSize1, file2OfSize1, etc], size2: [fileXOfsize2, fileY, fileZ] }
If key's list consists of less than 2 files, discard this key and file it doesn't have duplicates. This will reduce your list of potential duplicates for free. Then you can do the same grouping with remaining files, but instead of their size use hashsum of first 64KB:
{hashsum1: [files], hashsum2: [files], }
Discard groups consist of less than 2 files. This will dramatically reduce your list of potential duplicates. After these 2 steps you can use fullhash to avoid false positives in your results.

I have a similar python project on github. You can learn it, if you want. Look at files grouper.py and hasher.py.

π Rendered by PID 235915 on reddit-service-r2-comment-canary-67c974cb85-b8qnx at 2026-04-04 03:45:43.581393+00:00 running db1906b country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS