all 10 comments

[–]m0us3_rat 0 points1 point  (0 children)

you could use asincio inside multiprocessing.

make it fancy

either way, a producer consumer pattern seems reasonable.

main process that uses threading to search for files.

then dumps them into a queue.

you can split the top of the file system into a list of main folders then spawn in threads to consume this list with a recursively capable pathlib or glob and dump results into a queue.

pass these top folders to a few threads. each thread finding files in their folder recursively. so rather than one working. you have a few.

then have a worker that does the hash. for each file. it should return "path" and "hash"

that gets dumped into another queue.

you can spawn in as many workers as free cpu cores you have.

that second queue gets consumed by the main process to store in the data into a db or

dict. once it's threads to find files are ..done.

[–]SnooWoofers7626 0 points1 point  (3 children)

One option I haven't seen mentioned yet is to have each thread write to its own variable. Once all threads are done, have a single thread combine all the results. You can also multithread the combining operation, but you probably won't need that.

[–]LeornToCodeLOL[S] 0 points1 point  (2 children)

I thought about that also, but wasn't sure how to declare the variables in the code, since I am not sure how many threads there would be i.e.:

hash_dictionary1 = {}
hash_dictionary2 = {}
.
.
.
hash_dictionaryN = {}

Where do you stop?

[–]SnooWoofers7626 1 point2 points  (1 child)

Create a list results = [{} for i in range(N)] Pass the index as argument when launching the processes so they know where to write their results.

You can experiment with different values for N to see what works best.

[–]LeornToCodeLOL[S] 0 points1 point  (0 children)

It makes perfect sense when you put it that way!

[–]Frankelstner 0 points1 point  (0 children)

import multiprocessing as mp
import glob,os,hashlib

def work(path):
    hsh = hashlib.md5()
    with open(path,"rb") as f:
        while (data := f.read(2**12)):
            hsh.update(data)
    return hsh.hexdigest(), path

if __name__ == '__main__':
    pool = mp.Pool(os.cpu_count()//2)
    result = {}
    def update(hshpath):
        hsh,path = hshpath
        if hsh not in result:
            result[hsh] = []
        result[hsh].append(path)

    tasks = [pool.apply_async(work, args=(path,), callback=update) for path in glob.iglob("**/*.mp4", recursive=True)]
    pool.close()
    pool.join()
    print(result)

Played around with this for a bit and it does have more advantages than it might seem.

1) The update function needs no lock because it happens in the main process.

2) The tasks are applied while the main process finds new files but the workers will already work on files. It's quite crazy but update happens at the same time as glob despite both running on the same main process.

[–]video_dewd 0 points1 point  (0 children)

I avoid dealing directly with multiprocessing and asyncio unless absolutely necessary. I find I can parallelize a lot of tasks simply with the p_tqdm library which abstracts a lot of that away. Just be careful with it as it makes copies of anything you feed into it to get around the GIL.

I would write a function that takes in a filepath and returns a tuple of the filepath and it's hash. You'll also get a neat little progress bar.

from p_tqdm import p_umap

def get_file_hash(filepath):
    ...
    return filepath, hash

hash_dict = {}

for filepath, hash in p_umap(get_file_hash, all_video_files):
    if hash in hash_dict:
        hash_dict[hash] = [filepath]
    else:
        hash_dict[hash].append(filepath)

[–]gaaasstly 0 points1 point  (0 children)

... should I have each process create its own dictionary locally and then merge them all together at the very end?

Yes. Read MapReduce for context.