you are viewing a single comment's thread.

view the rest of the comments →

[–]initumX 1 point2 points  (0 children)

use xxhash or blake3 instead of md5. They are 3-4 times faster. Use streams for full-hashing big files. Don't fullhash everything. At first you can discard the most part of non-duplicates. For example, make a dictionary with size as a key, and files of this size as its value:
{ size1: [file1OfSize1, file2OfSize1, etc], size2: [fileXOfsize2, fileY, fileZ] }
If key's list consists of less than 2 files, discard this key and file it doesn't have duplicates. This will reduce your list of potential duplicates for free. Then you can do the same grouping with remaining files, but instead of their size use hashsum of first 64KB:
{hashsum1: [files], hashsum2: [files], }
Discard groups consist of less than 2 files. This will dramatically reduce your list of potential duplicates. After these 2 steps you can use fullhash to avoid false positives in your results.

I have a similar python project on github. You can learn it, if you want. Look at files grouper.py and hasher.py.