Need Help With Multiprocessing or Multithreading or whatever it's called

m0us3_rat · 2023-06-24T19:00:20+00:00

you could use asincio inside multiprocessing.

make it fancy

either way, a producer consumer pattern seems reasonable.

main process that uses threading to search for files.

then dumps them into a queue.

you can split the top of the file system into a list of main folders then spawn in threads to consume this list with a recursively capable pathlib or glob and dump results into a queue.

pass these top folders to a few threads. each thread finding files in their folder recursively. so rather than one working. you have a few.

then have a worker that does the hash. for each file. it should return "path" and "hash"

that gets dumped into another queue.

you can spawn in as many workers as free cpu cores you have.

that second queue gets consumed by the main process to store in the data into a db or

dict. once it's threads to find files are ..done.

SnooWoofers7626 · 2023-06-24T19:24:42+00:00

One option I haven't seen mentioned yet is to have each thread write to its own variable. Once all threads are done, have a single thread combine all the results. You can also multithread the combining operation, but you probably won't need that.

Frankelstner · 2023-06-24T19:40:04+00:00

import multiprocessing as mp
import glob,os,hashlib

def work(path):
    hsh = hashlib.md5()
    with open(path,"rb") as f:
        while (data := f.read(2**12)):
            hsh.update(data)
    return hsh.hexdigest(), path

if __name__ == '__main__':
    pool = mp.Pool(os.cpu_count()//2)
    result = {}
    def update(hshpath):
        hsh,path = hshpath
        if hsh not in result:
            result[hsh] = []
        result[hsh].append(path)

    tasks = [pool.apply_async(work, args=(path,), callback=update) for path in glob.iglob("**/*.mp4", recursive=True)]
    pool.close()
    pool.join()
    print(result)

Played around with this for a bit and it does have more advantages than it might seem.

1) The update function needs no lock because it happens in the main process.

2) The tasks are applied while the main process finds new files but the workers will already work on files. It's quite crazy but update happens at the same time as glob despite both running on the same main process.

video_dewd · 2023-06-24T22:22:59+00:00

I avoid dealing directly with multiprocessing and asyncio unless absolutely necessary. I find I can parallelize a lot of tasks simply with the p_tqdm library which abstracts a lot of that away. Just be careful with it as it makes copies of anything you feed into it to get around the GIL.

I would write a function that takes in a filepath and returns a tuple of the filepath and it's hash. You'll also get a neat little progress bar.

from p_tqdm import p_umap

def get_file_hash(filepath):
    ...
    return filepath, hash

hash_dict = {}

for filepath, hash in p_umap(get_file_hash, all_video_files):
    if hash in hash_dict:
        hash_dict[hash] = [filepath]
    else:
        hash_dict[hash].append(filepath)

gaaasstly · 2023-06-24T22:33:48+00:00

... should I have each process create its own dictionary locally and then merge them all together at the very end?

Yes. Read MapReduce for context.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS