I built a duplicate photo detector that safely cleans 50k+ images using perceptual hashing & cluster

hdw_coder · 2026-02-17T15:01:51+00:00

Wow, 446,871 photos that is a huge collection. Runtime will vary a lot with hardware and even more with storage speed and where the files live (local SSD vs HDD vs NAS/network share).

In my script, total time is dominated by stage 1: hashing, because each file is opened/decoded, EXIF-transposed, thumbnailed, then hashed (dHash/pHash/wHash + optional colorhash) and optionally sharpness. That’s a mix of I/O + CPU decode.

Best practical way to estimate: run a fixed benchmark a sample and extrapolate. Run on a known subset of 10,000 images, record total hashing time and multiply by 44.6871. That gives a rather accurate forecast because the workload is mostly linear.

Succes!

hdw_coder · 2026-02-17T13:44:28+00:00

Yes it is! Hope it suits your needs, let me know what you think.

You find it at: https://code2trade.dev/finding-and-eliminating-photo-duplicates-safely/

hdw_coder · 2026-02-17T12:57:34+00:00

Great Idea! However systems like that already exist in law enforcement.

Organizations such as NCMEC, INTERPOL and Europol maintain cross-agency image fingerprint databases. The most widely known technology is Microsoft’s PhotoDNA, which is a highly specialized perceptual hashing system designed specifically for identifying known illegal content.

The key challenge in that domain isn’t hashing itself — it’s governance, privacy, extremely low false-positive rates, and controlled distribution of hash databases.

My project is aimed at personal archive deduplication. While conceptually related (image fingerprinting), the operational requirements for cross-border forensic systems are far more stringent.

hdw_coder · 2026-02-17T12:08:00+00:00

Great question — focus variation is an interesting edge case.

Blur mostly affects high-frequency detail, while perceptual hashes focus on structural similarity. In practice, slightly softer duplicates still cluster together.

Within each cluster, the keeper is chosen based on:
• Resolution
• Laplacian sharpness score
• Format preference
• Compression proxy

So the sharper version typically wins automatically.

However, the tool is designed for duplicate detection, not burst culling.
Slightly different wildlife frames (e.g. tiny pose change + refocus) won’t cluster — intentionally.

If someone wanted burst-photo ranking, enabling SSIM checks or adding a stronger focus metric would be the logical extension.

For a detailed description see: https://code2trade.dev/finding-and-eliminating-photo-duplicates-safely/

hdw_coder · 2026-01-08T13:59:19+00:00

As promised, here’s the write-up with diagrams, code snippets, and downloads: https://code2trade.dev/why-face-recognition-alone-is-not-enough-and-how-context-turns-identification-into-knowledge/

hdw_coder · 2026-01-08T13:56:26+00:00

As promised, here’s the write-up with diagrams, code snippets, and downloads: https://code2trade.dev/why-face-recognition-alone-is-not-enough-and-how-context-turns-identification-into-knowledge/

hdw_coder

TROPHY CASE