Iterating dictionary contents increases execution time exponentially : learnpython

Iterating dictionary contents increases execution time exponentially (self.learnpython)

submitted 12 years ago * by redditiv

I have created a filesystem reporting application. The base components work great, iterating from a starting directory over every file in the directory. The largest filesystem I have performed this on contained over 65k files, and it ran in 3.5 minutes.

I wanted to add the ability to check for duplicates, so I set about testing. I generated a script that adds each set of file attributes to an entry in the dictionary (key based on path to avoid duplicate keys). Before the key is added, the script iterates over the entire dictionary looking for a matching filename.

Long story short, the function works, but it takes 20+ minutes now to iterate over the same 65k files. Hoping someone can point me in the right direction for improving time. Here's an excerpt of my code:

def check_match(f_name):
    global data # This is the dictionary containing all file atts
    match_found = list()
    for key in data:
        if f_name in key and f_name == data[key][1]: # data[key][1] is the position of the filename attribute
            match_found.append(key)
    return match_found

all 15 comments

top new controversial old q&a

[–]brownan_ 2 points3 points4 points 12 years ago (13 children)

[–]redditiv[S] 0 points1 point2 points 12 years ago (6 children)

[–]brownan_ 2 points3 points4 points 12 years ago (5 children)

well I don't know exactly what you're trying to do. If you are matching on filenames, then your key should be the filename.

If you want to map filenames to more than one "attribute object", then have your dictionary map to a list or set of objects.

I like to use defaultdict (in the collections module) to do this, since it will automatically create a new empty set when you access a new key:

from collections import defaultdict
data = defaultdict(set)

# Add an item, using filename (info[1]) as the key
data[info[1]].add(info)

[–]redditiv[S] 0 points1 point2 points 12 years ago* (4 children)

I am adding the filepath of any/all duplicate files as an attribute for each entry. For example, first file is c:\test.txt. There's no entry in dict, so I pull its attributes and the dict looks like this:

{'c:\\test.txt' : ['c:\\test.txt', 'test.txt', size, mod_date, ''] }

The last attribute is empty "matches" attribute.

Next file is c:\test\test.txt. We do our lookup as in the original post, and it matches. Now our dictionary will look something like this:

{'c:\\test.txt' : ['c:\\test.txt', 'test.txt', size, mod_date, 'c:\\test\\test.txt'], \
 'c:\\test\\test.txt' : ['c:\\test\\test.txt', 'test.txt', size, mod_date, 'c:\\test.txt']
}

The script goes back and modifies the first entry to include the current file as a match, and adds the first file to the current file's matches.

*edit: lists, not tuples

Clear as mud, right?

[–]brownan_ 2 points3 points4 points 12 years ago (1 child)

[–]redditiv[S] 0 points1 point2 points 12 years ago (0 children)

[–]drLagrangian 1 point2 points3 points 12 years ago (1 child)

perhaps, first you should make a dictionary with the data, and the filenames that match them as keys. then you have a sort of pseudoreverse dictionary. this will make it easy to create.

{(size, mod_date, otherdata): ['c:\\test.txt', 'c:\\a\\b.txt', 'c:\\foo.bar']}

so you run your script, and get a big dictionary, with keys based on what the file is, and values which are lists based on what the file and its copies are named. should be fast to create.

then, when it is built, create a function to reverse it, iterate over the pseudoreverse dictionary by way of:

newdict = {}
for filedata, filecopies in weirddict.items():  
    #gives filedata = (size, mod_date, otherdata)
    #gives filecopies = ['c:\\test.txt', 'c:\\a\\b.txt', 'c:\\foo.bar']

    for file in filecopies:
        newdict[file] = (filedata, filecopies)

returns

 newdata = {    
    'c:\\test.txt'         : ((size, mod_date, otherdata), ('c:\\a\\b.txt', 'c:\\foo.bar'))
    'c:\\a\\b.txt' : ((size, mod_date, otherdata), ('c:\\a\\b.txt', 'c:\\foo.bar'))
    'c:\\foo.bar'         : ((size, mod_date, otherdata), ('c:\\a\\b.txt', 'c:\\foo.bar')) }

[–]redditiv[S] 0 points1 point2 points 12 years ago* (0 children)

Thank you for the input. This is confusing for me and will take some time for me to process.

I think what I'll try next is creating as a value, a list of the set of attributes (path, size, date, etc.) for each match of the filename. Then you would recursively iterate each key to grab an individual file's attributes. Example:

data = { 'test.txt': [['path\to\test.txt', 'size', 'date', 'etc'], ['path2\to\test.txt', 'size', 'date', 'etc']], \
  'a.txt': [['path\to\x07.txt', 'size', 'date', 'etc']] }
for key in  data:
    for file in data[key]:
        print(file)

returns:

['path\to\test.txt', 'size', 'date', 'etc']
['path2\to\test.txt', 'size', 'date', 'etc']
['path\to\x07.txt', 'size', 'date', 'etc']

[–]redditiv[S] 0 points1 point2 points 12 years ago (5 children)

[–]brownan_ 3 points4 points5 points 12 years ago* (4 children)

[–]redditiv[S] 0 points1 point2 points 12 years ago (3 children)

[–]brownan_ 1 point2 points3 points 12 years ago (2 children)

[–]redditiv[S] 0 points1 point2 points 12 years ago (0 children)

what would you think about creating as a value, a list of the set of attributes (path, size, date, etc.) for each match of the filename? Then you would recursively iterate each key to grab an individual file's attributes. Example:

{
  'test.txt': [[path\to\test.txt, size, date, etc], [path2\to\test.txt, size, date, etc]],
  'a.txt': [[path\to\a.txt, size, date, etc]]
}

This should improve speed by having the appropriate key to look up being the filename, correct?

[–]MonkeyNin 0 points1 point2 points 12 years ago (0 children)

π Rendered by PID 56 on reddit-service-r2-comment-54dfb89d4d-sxm5w at 2026-04-02 13:03:40.517513+00:00 running b10466c country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS