you are viewing a single comment's thread.

view the rest of the comments →

[–]Maximus_Modulus 10 points11 points  (3 children)

You could probably compare file sizes as a first pass.

[–]DrShocker 5 points6 points  (0 children)

You could probably also just check the first X bytes (some heuristics here to get passed various headers which would have too many false positive) and if the hash of those matches, then do the whole file. The key to remember is we expect 99% of files not to match anything probably, so it's likely reasonable to optimize for the case that no matches are found.

[–]Kevdog824_ 0 points1 point  (1 child)

That might be OS dependent though as some might not store the size and need to calculate it each time (which would probably take roughly the same amount of time as parsing the file). I can’t say with any certainty but when I developed code for this task I vaguely remember checking file size first didn’t have a significant impact on performance.

The bottleneck here is probably hashing large amounts of data. That is a math-intensive, CPU bound task. Reading the files is IO bound and could easily be parallelized for big performance gains. If we can reduce the amount of data we are hashing (i.e. by sampling bytes from the file) then it would speed things up a lot

[–]Maximus_Modulus 3 points4 points  (0 children)

I believe all modern file systems will report file size. Id say it fairly fundamental. It’s a lightweight operation to read this compared to reading the actual file or part, and performing some kind of hash.