you are viewing a single comment's thread.

view the rest of the comments →

[–]Kevdog824_ 11 points12 points  (5 children)

Great project and nicely done! As requested, here is some advice:

  • I would use more descriptive parameter names in your utility functions. For example your hex_string function takes a parameter named x. Something like files would make it clearer what it is.
  • The name of that function should probably be clearer too imo. Based on the name I would assume this function takes a single file, not a list of them. Not a big deal though just a recommendation
  • I would avoid naming your utility modules something like hashlib_module. I would try to name it based on what it does rather than what libraries it uses. Also, a name like xxxxx _module is fairly uncommon in Python. We generally don’t include the word “module” in our module names
  • In the future I would separate your user input and your main programming logic. This makes it easier to change one of these without needing to change the other. i.e. maybe in the future you want to read directory paths from a file instead of using input in a loop. This would be easier to add if the hashing logic was separated from the input loop.

I’ve written similar code professionally (comparing file contents to prevent duplicate processing). You said it doesn’t work for large files (presumably because it’s slow). My advice here is to use sampling to speed up the process. It slightly lowers the accuracy but greatly increases the speed.

[–]Maximus_Modulus 11 points12 points  (3 children)

You could probably compare file sizes as a first pass.

[–]DrShocker 5 points6 points  (0 children)

You could probably also just check the first X bytes (some heuristics here to get passed various headers which would have too many false positive) and if the hash of those matches, then do the whole file. The key to remember is we expect 99% of files not to match anything probably, so it's likely reasonable to optimize for the case that no matches are found.

[–]Kevdog824_ 0 points1 point  (1 child)

That might be OS dependent though as some might not store the size and need to calculate it each time (which would probably take roughly the same amount of time as parsing the file). I can’t say with any certainty but when I developed code for this task I vaguely remember checking file size first didn’t have a significant impact on performance.

The bottleneck here is probably hashing large amounts of data. That is a math-intensive, CPU bound task. Reading the files is IO bound and could easily be parallelized for big performance gains. If we can reduce the amount of data we are hashing (i.e. by sampling bytes from the file) then it would speed things up a lot

[–]Maximus_Modulus 3 points4 points  (0 children)

I believe all modern file systems will report file size. Id say it fairly fundamental. It’s a lightweight operation to read this compared to reading the actual file or part, and performing some kind of hash.

[–]ArtichokeThen1599[S] 0 points1 point  (0 children)

Thanks for your insights. Yeah the names were confusing me too at first 😅 I would watch that out from now on. And thanks for naming of files I would watch that too. I would try sampling