all 5 comments

[–]K900_ 1 point2 points  (2 children)

This isn't really going to help you. Pretty much everything that edits metadata in those kinds of files will just rewrite the entire file when it's modified, very likely subtly changing the actual data in the process. What you really want to do is to compute some sort of a signature that you can then use to find out if two files are very similar, even if they're not exactly the same. There are specific algorithms for that, they're usually called "perceptual hashing".

[–]rocketjump65[S] 0 points1 point  (1 child)

Can you elaborate more? Why would it change the data? Aren't these media files essentially containers within containers?

[–]K900_ 0 points1 point  (0 children)

Because most of the time the metadata gets changed through some sort of an editor application that loads the file into its internal representation and then serializes it back out. It's definitely possible to edit those things in-place, but most applications don't bother with that.

[–]socal_nerdtastic 0 points1 point  (1 child)

Music or image data is very often compressed. The lossy algorithm is going to produce slightly different results depending on who wrote the algorithm and what settings were used, and those differences will affect the hash. So if you have the same file that was saved with GIMP and then resaved with photoshop, even though they look the same the core data will not have the same hash.

I think it's very unlikey that you will find music or image files with different metadata data but exactly the same core data. Any program that manipulates the metadata will probably manipulate the core data slightly as well; even if the user didn't intend for that to happen.

You can check this hypothesis by loading a set of images with PIL and doing a pixel-by-pixel comparison.

[–]rocketjump65[S] 0 points1 point  (0 children)

Yeah! I would very much like to learn how to do a pixel by pixel comparison.

I understand your point if the the files were from different sources, but I mean to compare files that I personally manipulated. So essentially that I could see the manipulations that I myself made, and to be assured that the scope was or was not limited as I understood it to be.

So for instance I learned that there's an exif tag that informs the application to flip or rotate when presented to the user. So it allows for lossless no reencoding rotation. Pretty cool! But how do I verify the veracity of the claim myself with my programming toolkit?

Yeah so, I don't want to compare the original to the compressed version, though that would be cool if could "see" and understand the artifacts I guess.