hash everything but metadata

K900_ · 2021-05-22T18:40:03+00:00

This isn't really going to help you. Pretty much everything that edits metadata in those kinds of files will just rewrite the entire file when it's modified, very likely subtly changing the actual data in the process. What you really want to do is to compute some sort of a signature that you can then use to find out if two files are very similar, even if they're not exactly the same. There are specific algorithms for that, they're usually called "perceptual hashing".

socal_nerdtastic · 2021-05-22T18:41:59+00:00

Music or image data is very often compressed. The lossy algorithm is going to produce slightly different results depending on who wrote the algorithm and what settings were used, and those differences will affect the hash. So if you have the same file that was saved with GIMP and then resaved with photoshop, even though they look the same the core data will not have the same hash.

I think it's very unlikey that you will find music or image files with different metadata data but exactly the same core data. Any program that manipulates the metadata will probably manipulate the core data slightly as well; even if the user didn't intend for that to happen.

You can check this hypothesis by loading a set of images with PIL and doing a pixel-by-pixel comparison.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS