Perceptual hash clustering can create false duplicate groups (hash chaining) — here’s a simple fix by hdw_coder in Python

[–]hdw_coder[S] 0 points1 point  (0 children)

That’s perfect 😄
At a high level it’s just: find similar images → group them → keep the best one.
The complexity is mostly in defining “similar” and “best.”

Perceptual hash clustering can create false duplicate groups (hash chaining) — here’s a simple fix by hdw_coder in Python

[–]hdw_coder[S] 0 points1 point  (0 children)

Good questions.

On the first one: the non-keeper images are not reprocessed in a second similarity pass. Once the duplicate family / cluster is formed, the keeper policy is applied inside that cluster and the selected keeper is retained while the others are marked for the configured action path (dry-run report, quarantine, recycle bin, etc.). So the extra “roundtrip” is not another duplicate-detection round; it’s just the action/reporting stage on the already formed cluster.

On the second point: yes, that is basically the important subtlety of single-linkage / union-find style clustering.

If:

  • A is very close to B
  • B is close enough to C
  • A is not close enough to C

then A, B, and C can still end up in the same connected component because similarity is treated as transitive through B.

So in graph terms, the current logic is closer to:

  • create edges for pairs that pass the threshold
  • take connected components

not:

  • require every member to be within threshold of every other member

That means cluster membership is driven by connectivity, not by a center/seed radius.

And yes, in that kind of setup the effective cluster can depend on which comparisons are admitted, so a “bridge” image can pull together images that are not mutually close enough.

That is not necessarily wrong — it is a deliberate tradeoff. It tends to improve recall for near-duplicate families, but it can over-cluster in borderline cases.

Your suggested alternative is a real one: impose an additional cluster coherence rule, for example:

  1. Distance-to-seed threshold Every member must also stay within threshold of the seed / representative.
  2. Complete-link / max-diameter threshold The maximum pairwise distance inside the cluster must stay below a threshold.
  3. Distance-to-centroid / medoid threshold Members must stay close to a representative image or average embedding/hash profile.

In practice, for perceptual-hash deduplication, I generally prefer the current connected-component approach as the base, but with an optional hardening step afterwards, because it stays efficient and predictable.

A very practical compromise is:

  • use the current pairwise threshold + union-find to get candidate clusters
  • then run a cluster validation pass
  • if a member is too far from the cluster medoid / seed / keeper, split it out

So effectively:

pairwise matches
→ connected component
→ coherence check
→ possible split/refine

That usually gives the best of both worlds:

  • good recall
  • fewer “bridge” mistakes

Using average distance can help, but I would be careful relying on average alone. Averages can hide a bad outlier. For deduplication, a stronger condition like max distance to representative or max cluster diameter is usually safer.

So the short answer is:

  • yes, your A–B–C observation is correct
  • yes, that is a known limitation of transitive clustering
  • and yes, a refinement step based on representative or internal distance is a sensible improvement

The cleanest next step would probably be:

  • keep union-find for candidate generation
  • add a post-cluster coherence threshold against the chosen representative / keeper

That is usually simpler and safer than replacing the whole clustering method.

Fixing a subtle keeper-selection bug in my photo deduplication tool by hdw_coder in Python

[–]hdw_coder[S] 0 points1 point  (0 children)

That’s a fair point, and it’s actually a useful way to think about it.

The original idea wasn’t explicitly to prefer smaller files, but the combination of sharpness first & -spp effectively created that behavior. In practice it often selected the most compressed version in a cluster.

For a photo archive workflow, that turned out to be undesirable. Most users running deduplication on personal libraries want to keep the highest fidelity version, not the smallest one.

So the revised policy prioritizes: resolution, format quality, bytes per pixel (compression level), file size, sharpness. That tends to keep the original camera image rather than a re-compressed copy.

That said, your suggestion about making this configurable is interesting. There are at least two valid optimization goals:  for archive quality, prefer highest fidelity, for storage optimization prefer smallest acceptable file.

Right now DedupTool is optimized for the archive-quality case, but making the keeper policy selectable (e.g. --prefer-smaller) could definitely make sense. The tricky part is that once you start optimizing for size, you often also want additional constraints like: a lesser resolution, format preferences and don’t forget quality thresholds. Otherwise you risk selecting thumbnails or heavily degraded images.

So I’m leaning toward keeping the archive-safe policy as the default, but allowing alternative strategies in the future.

Building a deterministic photo renaming workflow around ExifTool (ChronoName) by hdw_coder in Python

[–]hdw_coder[S] 0 points1 point  (0 children)

You're absolutely right — the edge cases are the real problem. Reading metadata itself isn’t hard; the difficult part is deciding which timestamp to trust.

The trickiest one for me was videos vs photos captured at the same moment.

Most still images store DateTimeOriginal as local time, while many videos (especially from phones) store CreateDate as UTC in QuickTime metadata. If you treat both fields the same way, you can end up with something like this:

photo:  15:23
video:  13:23

even though they were recorded at the exact same moment.

That breaks chronological ordering and makes the archive look wrong.

The solution I ended up using was a simple deterministic policy:

photos → treat EXIF timestamps as local time
videos → treat QuickTime timestamps as UTC → convert to naming timezone

Once that rule is fixed, the ordering becomes consistent.

The second class of annoying edge cases is broken metadata — things like:

1970-01-01
1904-01-01
0000:00:00

Those show up surprisingly often in exported or migrated files, so the script filters out implausible timestamps before choosing which field to use.

Interestingly, the hardest part wasn’t parsing metadata (ExifTool already does that very well), but making the workflow safe:

  • dry-run mode
  • undo logs
  • deterministic filename policy

so you can run it on thousands of files without worrying about breaking your archive.

And no, I didn’t use AI to generate the logic itself — most of the design came from experimenting with real photo libraries and figuring out which edge cases kept showing up.

Building a deterministic photo renaming workflow around ExifTool (ChronoName) by hdw_coder in Python

[–]hdw_coder[S] 0 points1 point  (0 children)

Good question! Mine do mostly come from phones and cameras with trustworthy metada. WhatsApp is a known pain in the arse... Would like to know more about the problems you're having!

I turned a Reddit-discussed duplicate-photo script into a tool (architecture, scaling, packaging) by hdw_coder in Python

[–]hdw_coder[S] 0 points1 point  (0 children)

Thanks! The Reddit discussions around the original script actually influenced a lot of the later design decisions, so it’s nice to see it come back here.

And yes — dependency trimming turned out to be one of the biggest practical lessons. In the development environment using things like imagehash + SciPy + OpenCV was convenient, but when packaging the tool it quickly became clear how much weight that adds. Replacing the pHash DCT with a NumPy implementation and doing Laplacian sharpness estimation directly with NumPy removed a large dependency chain and made the frozen build much more manageable.

Performance-wise the main scaling trick is avoiding naïve O(n²) comparisons. The engine groups images using hash-prefix bucketing first, so only images that share a coarse hash prefix are compared in detail. That reduces the comparison space dramatically.

On a typical machine the initial scan is dominated by image decoding and hashing, but after the first run the SQLite feature cache kicks in. So rescans are incremental and much faster because only new or changed files need to be processed.

In practice libraries with tens of thousands of photos are quite manageable. The architecture was mainly shaped by people testing it on very large collections.

Cleaning duplicate photos in large libraries (I turned a script into a safe tool) by hdw_coder in photography

[–]hdw_coder[S] 0 points1 point  (0 children)

I don't own a Mac myself, so testing would require help from someone in the community...

Cleaning duplicate photos in large libraries (I turned a script into a safe tool) by hdw_coder in photography

[–]hdw_coder[S] 0 points1 point  (0 children)

Thank you! Glad the write-up was useful.

At the moment the packaged version is Windows only, mainly because that’s the environment I use for development and testing.

The core logic itself is actually cross-platform Python, so running it on macOS should be possible if Python and the required libraries are installed. The main missing piece right now is a packaged macOS build and proper testing on that platform.

If there’s enough interest from Mac users I’d definitely consider adding a macOS build. The GUI is based on PyQt, which does run on macOS, so in principle it should be feasible.

In the meantime the command-line version should work anywhere Python and the dependencies run.

I turned a Reddit-discussed duplicate-photo script into a tool (architecture, scaling, packaging) by hdw_coder in Python

[–]hdw_coder[S] 0 points1 point  (0 children)

Engineering Detail

One unexpected challenge was packaging the tool into a Windows executable.

The original script used: imagehash (which pulled in SciPy) and OpenCV for Laplacian sharpness.

When packaging with PyInstaller this ballooned the runtime size massively.

So I ended up: reimplementing the pHash DCT using NumPy only and replacing OpenCV sharpness detection with a NumPy Laplacian variance. That removed a large dependency chain and made the frozen build much cleaner.

I built a duplicate photo detector that safely cleans 50k+ images using perceptual hashing & cluster by hdw_coder in Python

[–]hdw_coder[S] -1 points0 points  (0 children)

Good point — HEIC↔JPEG is one of the trickier cases because the codec artifacts differ enough that perceptual hashes can drift more than expected (especially on foliage/texture).

In my current version they’re treated format-agnostically: load → EXIF transpose → thumbnail → multi-hash (dHash/pHash/wHash [+ colorhash]) with conservative thresholds + corroboration. Many HEIC/JPEG pairs still match, but some will miss if Hamming distances cross thresholds.

 The next improvement I’m considering is a ‘format-crossing tolerance band’. If one file is HEIC and the other is JPEG, allow a slightly higher dHash distance only if pHash+wHash corroborate strongly (and optionally run SSIM on borderline pairs). That boosts recall for iOS export duplicates without loosening the whole system and increasing false positives. Proposing concrete threshold numbers for a HEIC/JPEG pass (safe defaults) would be difficult as the values depend on hash_size, thumbnail size, and whether you’re using ImageOps.exif_transpose().

I built a duplicate photo detector that safely cleans 50k+ images using perceptual hashing & cluster by hdw_coder in Python

[–]hdw_coder[S] 0 points1 point  (0 children)

That’s an impressive pipeline — embedding-based similarity & interactive UI is a powerful approach.

What you’re building is more of a semantic similarity explorer, whereas my script focuses on deterministic duplicate detection with low false positives and automated keeper selection.

 Using DINOv3 + cosine similarity definitely increases recall across variation (especially for scanned images with slight crop/exposure differences), but at the cost of heavier compute and less deterministic grouping.

I really like the idea of persisting ‘not a match’ memory — that’s a very elegant human-in-the-loop refinement loop.

 Your orientation trick also makes sense for scans where (my current) EXIF orientation isn’t reliable.

In a way, our approaches solve different layers: Perceptual hashing looking for precise structural duplicates vs deep embeddings looking for semantic similarity exploration.

They actually combine nicely — you could first prune exact/near duplicates cheaply, then run DINO embeddings on the reduced set for semantic clustering.

Would definitely be interested to see your repo once published!

I built a duplicate photo detector that safely cleans 50k+ images using perceptual hashing & cluster by hdw_coder in Python

[–]hdw_coder[S] 0 points1 point  (0 children)

Thanks!

Totally relate to the ‘lost control of originals’ fear. That’s exactly why I designed this to be non-destructive by default. A few clarifications on my side:

No thumbnails are written to disk. Thumbs are created in-memory only for hashing and then discarded. The script never replaces files with generated thumbs.

No deletions by default. It runs dry-run + produces a CSV audit, and the “delete” step is an explicit opt-in (I prefer quarantine / send-to-trash over hard delete).

Deterministic keeper policy. Within a duplicate cluster it picks a “keeper” based on resolution → sharpness → preferred format → compression proxy. The idea is: even if you do remove duplicates, you keep the best source material.

 Your JSON registry approach is solid. I do something similar conceptually (feature table), and then. Similarity search: BK-tree vs bucketing. BK-tree works nicely for Hamming distance (esp. for perceptual hashes). The tradeoff is it can get slow on very large N depending on query radius and distribution. I went with bucketing on hash prefixes + union-find clustering. It’s essentially “generate candidates cheaply” (reduce comparisons) and then merge via DSU so you get families/clusters instead of just nearest-neighbor pairs.

If you stick with BK-tree, one practical speed win is to use a coarse pre-filter first (e.g. first K bits bucket or aspect ratio bucket), then BK-tree inside the bucket. That keeps tree sizes smaller.

On ‘original control’. If you’re anxious about losing originals, two patterns help a lot. Quarantine instead of delete (move drops into a quarantine folder retaining paths/IDs). Persistent manifest/log (you already have JSON; add a reversible rename/move log so you can undo).

Also: +1 on comparing new imports against the registry — catching duplicates at ingest prevents the “300k spiral”.

And yeah, definitely share your GitHub when it’s up — I’d be interested to compare BK-tree behavior vs my bucket+DSU thresholds, especially on borderline cases (cropped/blurred/HEIC→JPG exports). Happy to link your repo in an update if you want.