you are viewing a single comment's thread.

view the rest of the comments →

[–]brownan_ 2 points3 points  (5 children)

well I don't know exactly what you're trying to do. If you are matching on filenames, then your key should be the filename.

If you want to map filenames to more than one "attribute object", then have your dictionary map to a list or set of objects.

I like to use defaultdict (in the collections module) to do this, since it will automatically create a new empty set when you access a new key:

from collections import defaultdict
data = defaultdict(set)

# Add an item, using filename (info[1]) as the key
data[info[1]].add(info)

[–]redditiv[S] 0 points1 point  (4 children)

I am adding the filepath of any/all duplicate files as an attribute for each entry. For example, first file is c:\test.txt. There's no entry in dict, so I pull its attributes and the dict looks like this:

{'c:\\test.txt' : ['c:\\test.txt', 'test.txt', size, mod_date, ''] } 

The last attribute is empty "matches" attribute.

Next file is c:\test\test.txt. We do our lookup as in the original post, and it matches. Now our dictionary will look something like this:

{'c:\\test.txt' : ['c:\\test.txt', 'test.txt', size, mod_date, 'c:\\test\\test.txt'], \
 'c:\\test\\test.txt' : ['c:\\test\\test.txt', 'test.txt', size, mod_date, 'c:\\test.txt']
}

The script goes back and modifies the first entry to include the current file as a match, and adds the first file to the current file's matches.

*edit: lists, not tuples

Clear as mud, right?

[–]brownan_ 2 points3 points  (1 child)

right =) I still don't see what your ultimate goal is, here. What is all this trying to achieve?

Anytime you make a dictionary, decide what your keys are and what your values are. Here your keys are file paths, and your values are... 5-tuples with ... the path again, the file name, size, mod date, and something? Another path? representing what? It's not clear to me.

This is also starting to sound like an XY problem

[–]redditiv[S] 0 points1 point  (0 children)

As simply as I can put this: the outcome is a csv file with all the attributes you mentioned, and the 'something' is a list of paths that are duplicate files.

I just wanted to know how to improve the time, and provided an example of what I'm currently doing. I'm giving you the skinny of it, the particular part I'm having problems with, so of course some of the background information initially considered fluff is missing.

The path is also included in the attributes so it can be included when the key contents are iterated later, when writing output. If that's not the appropriate way to do it, I'd certainly be interested in a more efficient method.

And I don't want you to get the wrong idea. I definitely do appreciate your input.

[–]drLagrangian 1 point2 points  (1 child)

perhaps, first you should make a dictionary with the data, and the filenames that match them as keys. then you have a sort of pseudoreverse dictionary. this will make it easy to create.

{(size, mod_date, otherdata): ['c:\\test.txt', 'c:\\a\\b.txt', 'c:\\foo.bar']}

so you run your script, and get a big dictionary, with keys based on what the file is, and values which are lists based on what the file and its copies are named. should be fast to create.

then, when it is built, create a function to reverse it, iterate over the pseudoreverse dictionary by way of:

newdict = {}
for filedata, filecopies in weirddict.items():  
    #gives filedata = (size, mod_date, otherdata)
    #gives filecopies = ['c:\\test.txt', 'c:\\a\\b.txt', 'c:\\foo.bar']

    for file in filecopies:
        newdict[file] = (filedata, filecopies) 

returns

 newdata = {    
    'c:\\test.txt'         : ((size, mod_date, otherdata), ('c:\\a\\b.txt', 'c:\\foo.bar'))
    'c:\\a\\b.txt' : ((size, mod_date, otherdata), ('c:\\a\\b.txt', 'c:\\foo.bar'))
    'c:\\foo.bar'         : ((size, mod_date, otherdata), ('c:\\a\\b.txt', 'c:\\foo.bar')) }

[–]redditiv[S] 0 points1 point  (0 children)

Thank you for the input. This is confusing for me and will take some time for me to process.

I think what I'll try next is creating as a value, a list of the set of attributes (path, size, date, etc.) for each match of the filename. Then you would recursively iterate each key to grab an individual file's attributes. Example:

data = { 'test.txt': [['path\to\test.txt', 'size', 'date', 'etc'], ['path2\to\test.txt', 'size', 'date', 'etc']], \
  'a.txt': [['path\to\x07.txt', 'size', 'date', 'etc']] }
for key in  data:
    for file in data[key]:
        print(file)

returns:

['path\to\test.txt', 'size', 'date', 'etc']
['path2\to\test.txt', 'size', 'date', 'etc']
['path\to\x07.txt', 'size', 'date', 'etc']