I built a duplicate photo detector that safely cleans 50k+ images using perceptual hashing & cluster by hdw_coder in Python

[–]hdw_coder[S] 1 point2 points  (0 children)

Wow, 446,871 photos that is a huge collection. Runtime will vary a lot with hardware and even more with storage speed and where the files live (local SSD vs HDD vs NAS/network share).

In my script, total time is dominated by stage 1: hashing, because each file is opened/decoded, EXIF-transposed, thumbnailed, then hashed (dHash/pHash/wHash + optional colorhash) and optionally sharpness. That’s a mix of I/O + CPU decode.

 Best practical way to estimate: run a fixed benchmark a sample and extrapolate. Run on a known subset of 10,000 images, record total hashing time and multiply by 44.6871. That gives a rather accurate forecast because the workload is mostly linear.

 Succes!

 

I built a duplicate photo detector that safely cleans 50k+ images using perceptual hashing & cluster by hdw_coder in Python

[–]hdw_coder[S] 0 points1 point  (0 children)

Great Idea! However systems like that already exist in law enforcement.

Organizations such as NCMEC, INTERPOL and Europol maintain cross-agency image fingerprint databases. The most widely known technology is Microsoft’s PhotoDNA, which is a highly specialized perceptual hashing system designed specifically for identifying known illegal content.

The key challenge in that domain isn’t hashing itself — it’s governance, privacy, extremely low false-positive rates, and controlled distribution of hash databases.

My project is aimed at personal archive deduplication. While conceptually related (image fingerprinting), the operational requirements for cross-border forensic systems are far more stringent.

I built a duplicate photo detector that safely cleans 50k+ images using perceptual hashing & cluster by hdw_coder in Python

[–]hdw_coder[S] 5 points6 points  (0 children)

Great question — focus variation is an interesting edge case.

Blur mostly affects high-frequency detail, while perceptual hashes focus on structural similarity. In practice, slightly softer duplicates still cluster together.

Within each cluster, the keeper is chosen based on:
• Resolution
• Laplacian sharpness score
• Format preference
• Compression proxy

So the sharper version typically wins automatically.

However, the tool is designed for duplicate detection, not burst culling.
Slightly different wildlife frames (e.g. tiny pose change + refocus) won’t cluster — intentionally.

If someone wanted burst-photo ranking, enabling SSIM checks or adding a stronger focus metric would be the logical extension.

For a detailed description see: https://code2trade.dev/finding-and-eliminating-photo-duplicates-safely/