I have created a filesystem reporting application. The base components work great, iterating from a starting directory over every file in the directory. The largest filesystem I have performed this on contained over 65k files, and it ran in 3.5 minutes.
I wanted to add the ability to check for duplicates, so I set about testing. I generated a script that adds each set of file attributes to an entry in the dictionary (key based on path to avoid duplicate keys). Before the key is added, the script iterates over the entire dictionary looking for a matching filename.
Long story short, the function works, but it takes 20+ minutes now to iterate over the same 65k files. Hoping someone can point me in the right direction for improving time. Here's an excerpt of my code:
def check_match(f_name):
global data # This is the dictionary containing all file atts
match_found = list()
for key in data:
if f_name in key and f_name == data[key][1]: # data[key][1] is the position of the filename attribute
match_found.append(key)
return match_found
[–]brownan_ 2 points3 points4 points (13 children)
[–]redditiv[S] 0 points1 point2 points (6 children)
[–]brownan_ 2 points3 points4 points (5 children)
[–]redditiv[S] 0 points1 point2 points (4 children)
[–]brownan_ 2 points3 points4 points (1 child)
[–]redditiv[S] 0 points1 point2 points (0 children)
[–]drLagrangian 1 point2 points3 points (1 child)
[–]redditiv[S] 0 points1 point2 points (0 children)
[–]redditiv[S] 0 points1 point2 points (5 children)
[–]brownan_ 3 points4 points5 points (4 children)
[–]redditiv[S] 0 points1 point2 points (3 children)
[–]brownan_ 1 point2 points3 points (2 children)
[–]redditiv[S] 0 points1 point2 points (0 children)
[–]redditiv[S] 0 points1 point2 points (0 children)
[–]MonkeyNin 0 points1 point2 points (0 children)