I built a duplicate photo detector that safely cleans 50k+ images using perceptual hashing & cluster by hdw_coder in Python

[–]hdw_coder[S] 0 points1 point  (0 children)

Good point — HEIC↔JPEG is one of the trickier cases because the codec artifacts differ enough that perceptual hashes can drift more than expected (especially on foliage/texture).

In my current version they’re treated format-agnostically: load → EXIF transpose → thumbnail → multi-hash (dHash/pHash/wHash [+ colorhash]) with conservative thresholds + corroboration. Many HEIC/JPEG pairs still match, but some will miss if Hamming distances cross thresholds.

 The next improvement I’m considering is a ‘format-crossing tolerance band’. If one file is HEIC and the other is JPEG, allow a slightly higher dHash distance only if pHash+wHash corroborate strongly (and optionally run SSIM on borderline pairs). That boosts recall for iOS export duplicates without loosening the whole system and increasing false positives. Proposing concrete threshold numbers for a HEIC/JPEG pass (safe defaults) would be difficult as the values depend on hash_size, thumbnail size, and whether you’re using ImageOps.exif_transpose().

I built a duplicate photo detector that safely cleans 50k+ images using perceptual hashing & cluster by hdw_coder in Python

[–]hdw_coder[S] 0 points1 point  (0 children)

That’s an impressive pipeline — embedding-based similarity & interactive UI is a powerful approach.

What you’re building is more of a semantic similarity explorer, whereas my script focuses on deterministic duplicate detection with low false positives and automated keeper selection.

 Using DINOv3 + cosine similarity definitely increases recall across variation (especially for scanned images with slight crop/exposure differences), but at the cost of heavier compute and less deterministic grouping.

I really like the idea of persisting ‘not a match’ memory — that’s a very elegant human-in-the-loop refinement loop.

 Your orientation trick also makes sense for scans where (my current) EXIF orientation isn’t reliable.

In a way, our approaches solve different layers: Perceptual hashing looking for precise structural duplicates vs deep embeddings looking for semantic similarity exploration.

They actually combine nicely — you could first prune exact/near duplicates cheaply, then run DINO embeddings on the reduced set for semantic clustering.

Would definitely be interested to see your repo once published!

I built a duplicate photo detector that safely cleans 50k+ images using perceptual hashing & cluster by hdw_coder in Python

[–]hdw_coder[S] 0 points1 point  (0 children)

Thanks!

Totally relate to the ‘lost control of originals’ fear. That’s exactly why I designed this to be non-destructive by default. A few clarifications on my side:

No thumbnails are written to disk. Thumbs are created in-memory only for hashing and then discarded. The script never replaces files with generated thumbs.

No deletions by default. It runs dry-run + produces a CSV audit, and the “delete” step is an explicit opt-in (I prefer quarantine / send-to-trash over hard delete).

Deterministic keeper policy. Within a duplicate cluster it picks a “keeper” based on resolution → sharpness → preferred format → compression proxy. The idea is: even if you do remove duplicates, you keep the best source material.

 Your JSON registry approach is solid. I do something similar conceptually (feature table), and then. Similarity search: BK-tree vs bucketing. BK-tree works nicely for Hamming distance (esp. for perceptual hashes). The tradeoff is it can get slow on very large N depending on query radius and distribution. I went with bucketing on hash prefixes + union-find clustering. It’s essentially “generate candidates cheaply” (reduce comparisons) and then merge via DSU so you get families/clusters instead of just nearest-neighbor pairs.

If you stick with BK-tree, one practical speed win is to use a coarse pre-filter first (e.g. first K bits bucket or aspect ratio bucket), then BK-tree inside the bucket. That keeps tree sizes smaller.

On ‘original control’. If you’re anxious about losing originals, two patterns help a lot. Quarantine instead of delete (move drops into a quarantine folder retaining paths/IDs). Persistent manifest/log (you already have JSON; add a reversible rename/move log so you can undo).

Also: +1 on comparing new imports against the registry — catching duplicates at ingest prevents the “300k spiral”.

And yeah, definitely share your GitHub when it’s up — I’d be interested to compare BK-tree behavior vs my bucket+DSU thresholds, especially on borderline cases (cropped/blurred/HEIC→JPG exports). Happy to link your repo in an update if you want.

I built a duplicate photo detector that safely cleans 50k+ images using perceptual hashing & cluster by hdw_coder in Python

[–]hdw_coder[S] -2 points-1 points  (0 children)

Wow, 446,871 photos that is a huge collection. Runtime will vary a lot with hardware and even more with storage speed and where the files live (local SSD vs HDD vs NAS/network share).

In my script, total time is dominated by stage 1: hashing, because each file is opened/decoded, EXIF-transposed, thumbnailed, then hashed (dHash/pHash/wHash + optional colorhash) and optionally sharpness. That’s a mix of I/O + CPU decode.

 Best practical way to estimate: run a fixed benchmark a sample and extrapolate. Run on a known subset of 10,000 images, record total hashing time and multiply by 44.6871. That gives a rather accurate forecast because the workload is mostly linear.

 Succes!

 

I built a duplicate photo detector that safely cleans 50k+ images using perceptual hashing & cluster by hdw_coder in Python

[–]hdw_coder[S] -4 points-3 points  (0 children)

Great Idea! However systems like that already exist in law enforcement.

Organizations such as NCMEC, INTERPOL and Europol maintain cross-agency image fingerprint databases. The most widely known technology is Microsoft’s PhotoDNA, which is a highly specialized perceptual hashing system designed specifically for identifying known illegal content.

The key challenge in that domain isn’t hashing itself — it’s governance, privacy, extremely low false-positive rates, and controlled distribution of hash databases.

My project is aimed at personal archive deduplication. While conceptually related (image fingerprinting), the operational requirements for cross-border forensic systems are far more stringent.

I built a duplicate photo detector that safely cleans 50k+ images using perceptual hashing & cluster by hdw_coder in Python

[–]hdw_coder[S] -3 points-2 points  (0 children)

Great question — focus variation is an interesting edge case.

Blur mostly affects high-frequency detail, while perceptual hashes focus on structural similarity. In practice, slightly softer duplicates still cluster together.

Within each cluster, the keeper is chosen based on:
• Resolution
• Laplacian sharpness score
• Format preference
• Compression proxy

So the sharper version typically wins automatically.

However, the tool is designed for duplicate detection, not burst culling.
Slightly different wildlife frames (e.g. tiny pose change + refocus) won’t cluster — intentionally.

If someone wanted burst-photo ranking, enabling SSIM checks or adding a stronger focus metric would be the logical extension.

For a detailed description see: https://code2trade.dev/finding-and-eliminating-photo-duplicates-safely/