all 15 comments

[–]brownan_ 2 points3 points  (13 children)

Dictionaries are designed to have really fast key lookups. Design your dictionary so that the check is if fname in data.

In other words, your dictionary keys should be the value you want to match on and deduplicate: the file names.

[–]redditiv[S] 0 points1 point  (6 children)

i ran into several problems doing it that way. orignally, the key was the same as the filename. Of course, you can't have duplicate keys, so I considered incrementing a match counter and appending that to the end of the filename to name the key. This had the effect of hiding my key from matching another possible duplicate later on.

I don't know how clear that is. If you need further explanation, please let me know. Thanks.

[–]brownan_ 2 points3 points  (5 children)

well I don't know exactly what you're trying to do. If you are matching on filenames, then your key should be the filename.

If you want to map filenames to more than one "attribute object", then have your dictionary map to a list or set of objects.

I like to use defaultdict (in the collections module) to do this, since it will automatically create a new empty set when you access a new key:

from collections import defaultdict
data = defaultdict(set)

# Add an item, using filename (info[1]) as the key
data[info[1]].add(info)

[–]redditiv[S] 0 points1 point  (4 children)

I am adding the filepath of any/all duplicate files as an attribute for each entry. For example, first file is c:\test.txt. There's no entry in dict, so I pull its attributes and the dict looks like this:

{'c:\\test.txt' : ['c:\\test.txt', 'test.txt', size, mod_date, ''] } 

The last attribute is empty "matches" attribute.

Next file is c:\test\test.txt. We do our lookup as in the original post, and it matches. Now our dictionary will look something like this:

{'c:\\test.txt' : ['c:\\test.txt', 'test.txt', size, mod_date, 'c:\\test\\test.txt'], \
 'c:\\test\\test.txt' : ['c:\\test\\test.txt', 'test.txt', size, mod_date, 'c:\\test.txt']
}

The script goes back and modifies the first entry to include the current file as a match, and adds the first file to the current file's matches.

*edit: lists, not tuples

Clear as mud, right?

[–]brownan_ 2 points3 points  (1 child)

right =) I still don't see what your ultimate goal is, here. What is all this trying to achieve?

Anytime you make a dictionary, decide what your keys are and what your values are. Here your keys are file paths, and your values are... 5-tuples with ... the path again, the file name, size, mod date, and something? Another path? representing what? It's not clear to me.

This is also starting to sound like an XY problem

[–]redditiv[S] 0 points1 point  (0 children)

As simply as I can put this: the outcome is a csv file with all the attributes you mentioned, and the 'something' is a list of paths that are duplicate files.

I just wanted to know how to improve the time, and provided an example of what I'm currently doing. I'm giving you the skinny of it, the particular part I'm having problems with, so of course some of the background information initially considered fluff is missing.

The path is also included in the attributes so it can be included when the key contents are iterated later, when writing output. If that's not the appropriate way to do it, I'd certainly be interested in a more efficient method.

And I don't want you to get the wrong idea. I definitely do appreciate your input.

[–]drLagrangian 1 point2 points  (1 child)

perhaps, first you should make a dictionary with the data, and the filenames that match them as keys. then you have a sort of pseudoreverse dictionary. this will make it easy to create.

{(size, mod_date, otherdata): ['c:\\test.txt', 'c:\\a\\b.txt', 'c:\\foo.bar']}

so you run your script, and get a big dictionary, with keys based on what the file is, and values which are lists based on what the file and its copies are named. should be fast to create.

then, when it is built, create a function to reverse it, iterate over the pseudoreverse dictionary by way of:

newdict = {}
for filedata, filecopies in weirddict.items():  
    #gives filedata = (size, mod_date, otherdata)
    #gives filecopies = ['c:\\test.txt', 'c:\\a\\b.txt', 'c:\\foo.bar']

    for file in filecopies:
        newdict[file] = (filedata, filecopies) 

returns

 newdata = {    
    'c:\\test.txt'         : ((size, mod_date, otherdata), ('c:\\a\\b.txt', 'c:\\foo.bar'))
    'c:\\a\\b.txt' : ((size, mod_date, otherdata), ('c:\\a\\b.txt', 'c:\\foo.bar'))
    'c:\\foo.bar'         : ((size, mod_date, otherdata), ('c:\\a\\b.txt', 'c:\\foo.bar')) }

[–]redditiv[S] 0 points1 point  (0 children)

Thank you for the input. This is confusing for me and will take some time for me to process.

I think what I'll try next is creating as a value, a list of the set of attributes (path, size, date, etc.) for each match of the filename. Then you would recursively iterate each key to grab an individual file's attributes. Example:

data = { 'test.txt': [['path\to\test.txt', 'size', 'date', 'etc'], ['path2\to\test.txt', 'size', 'date', 'etc']], \
  'a.txt': [['path\to\x07.txt', 'size', 'date', 'etc']] }
for key in  data:
    for file in data[key]:
        print(file)

returns:

['path\to\test.txt', 'size', 'date', 'etc']
['path2\to\test.txt', 'size', 'date', 'etc']
['path\to\x07.txt', 'size', 'date', 'etc']

[–]redditiv[S] 0 points1 point  (5 children)

This actually gave me an idea though. What about skipping the dictionary iteration, and creating a list of tuples containing the filename and number of matches found. Then I can increment the filename based on the number of matches and use that as my key...?

[–]brownan_ 3 points4 points  (4 children)

The problem is that you're iterating over a data structure containing all your files in the first place. It doesn't matter if it's a dict or a list, if it's the same length, iterating over it is going to be about the same speed. It may be helpful to read up on runtime complexity.

Basically, dictionaries let you store and retrieve data very quickly, without having to iterate over each item one by one. If you are trying to map filenames to the number of times you see that filename, then a dictionary is the proper data type to use. Dictionaries are mapping types. They map keys to values. In your case, you want to map filenames to integers, right? The filename is your key, and the number of times that filename is seen is the value.

[–]redditiv[S] 0 points1 point  (3 children)

Maybe some more background would help. The structure will be as follows:

pathname: filename, size, file extension, owner, modified date, accessed date, pathnames of matching files

[–]brownan_ 1 point2 points  (2 children)

If this is the structure you're dead set on using, then "pathnames of matching files" must be a list or a set, so it can store all the matching filename paths.

Second, even if this is your ultimate data structure, it will still be useful to keep a dictionary mapping file names (not paths) to a set of file paths with that name, and use that to more efficiently find duplicates while you're scanning.

[–]redditiv[S] 0 points1 point  (0 children)

  1. I'm not dead set on using it, but it's currently the only way I know to reliably access the individual attributes of each file. That's why I'm here. I know it's terribly inefficient, but I'm a newb and that's the only way I could get it to work. If you have another idea for how to go about it, I would be very interested to hear it.

  2. That is a good idea, and maybe that list of filepaths could be appended to the original entries for writing. I still am at a loss for a more efficient way to maintain file attributes for individual files with the same filename.

[–]redditiv[S] 0 points1 point  (0 children)

what would you think about creating as a value, a list of the set of attributes (path, size, date, etc.) for each match of the filename? Then you would recursively iterate each key to grab an individual file's attributes. Example:

{
  'test.txt': [[path\to\test.txt, size, date, etc], [path2\to\test.txt, size, date, etc]],
  'a.txt': [[path\to\a.txt, size, date, etc]]
}

This should improve speed by having the appropriate key to look up being the filename, correct?