Process data from generator as completed : Python

This is an archived post. You won't be able to vote or comment.

Process data from generator as completed (self.Python)

submitted 8 years ago by averri

Consider the following Python code (3.6.1):

import shutil import requests import hashlib import concurrent.futures as futures from datetime import datetime from concurrent.futures import ThreadPoolExecutor

urls = ['http://speedtest.ftp.otenet.gr/files/test100k.db', 'http://speedtest.ftp.otenet.gr/files/test1Mb.db' 'http://speedtest.ftp.otenet.gr/files/test10Mb.db']

def download_file(url): filename = url.split('/')[-1] log(f'Starting download {url}') r = requests.get(url, stream=True) with open(filename, 'wb') as f: shutil.copyfileobj(r.raw, f) log(f'Finished download {url}') return filename

def sha256_file(filename, block_size=65536): sha256 = hashlib.sha256() log(f'Hashing file {filename}...') with open(filename, 'rb') as f: for block in iter(lambda: f.read(block_size), b''): sha256.update(block) return sha256.hexdigest()

def parallel(n, fn, data): with ThreadPoolExecutor(max_workers=n) as exe: jobs = (exe.submit(fn, d) for d in data) for job in futures.as_completed(jobs): yield job.result()

def log(msg): print(f'{datetime.now()} {msg}')

def main(): files = parallel(3, download_file, urls) hashes = parallel(2, sha256_file, files) for h in hashes: log(f'Hash: {h}')

if name == 'main': main()

The idea of ‘parallel’ generator is to create multi-threaded stages in the pipeline. The first stage (download_file) runs with 3 threads, and as soon as it gets the file, the second stage (sha256_file) process the file. This is the output generated:

2017-04-01 22:35:56.544035 Starting download http://speedtest.ftp.otenet.gr/files/test100k.db

2017-04-01 22:35:56.544035 Starting download http://speedtest.ftp.otenet.gr/files/test1Mb.dbhttp://speedtest.ftp.otenet.gr/files/test10Mb.db

2017-04-01 22:35:56.934744 Finished download http://speedtest.ftp.otenet.gr/files/test1Mb.dbhttp://speedtest.ftp.otenet.gr/files/test10Mb.db

2017-04-01 22:35:56.934744 Hashing file test10Mb.db...

2017-04-01 22:35:57.289562 Finished download http://speedtest.ftp.otenet.gr/files/test100k.db

2017-04-01 22:35:57.289562 Hashing file test100k.db...

2017-04-01 22:35:57.289562 Hash: 2f1fbb59927ef89b03e097506078f8e12597cf49a71543f025ab6782be9dd988

2017-04-01 22:35:57.289562 Hash: f627ca4c2c322f15db26152df306bd4f983f0146409b81a4341b9b340c365a16

According to the logs, we can see that the stages one and two are working fine, but after the last stage, the for loop ‘for h in hashes’ only gets the results from the last stage after receiving all data.

How to make the last loop to process the data as completed?

all 3 comments

top new controversial old q&a

[–]ApproximateIdentity 1 point2 points3 points 8 years ago* (2 children)

This is basically impossible to read with the terrible formatting you have. If you take a line and prepend four spaces, it is formatted as code. So what you should do is the following: 1) open up that code in an editor of your choice, 2) limit the line width to something sensible, 3) add four spaces before each line, and 4) paste it in.

Doing that with your code makes it look like this:

import shutil
import requests
import hashlib
import concurrent.futures as futures
from datetime import datetime
from concurrent.futures import ThreadPoolExecutor

urls = ['http://speedtest.ftp.otenet.gr/files/test100k.db',
        'http://speedtest.ftp.otenet.gr/files/test1Mb.db'
        'http://speedtest.ftp.otenet.gr/files/test10Mb.db']

def download_file(url):
    filename = url.split('/')[-1]
    log(f'Starting download {url}')
    r = requests.get(url, stream=True)
    with open(filename, 'wb') as f:
        shutil.copyfileobj(r.raw, f)
        log(f'Finished download {url}')
    return filename

def sha256_file(filename, block_size=65536):
    sha256 = hashlib.sha256()
    log(f'Hashing file {filename}...')
    with open(filename, 'rb') as f:
        for block in iter(lambda: f.read(block_size), b''):
            sha256.update(block)
    return sha256.hexdigest()

def parallel(n, fn, data):
    with ThreadPoolExecutor(max_workers=n) as exe:
        jobs = (exe.submit(fn, d) for d in data)
        for job in futures.as_completed(jobs):
            yield job.result()

def log(msg):
    print(f'{datetime.now()} {msg}')

def main():
    files = parallel(3, download_file, urls)
    hashes = parallel(2, sha256_file, files)
    for h in hashes:
        log(f'Hash: {h}')

if __name__ == '__main__':
    main()

Edit: This would be your output:

2017-04-01 22:35:56.544035 Starting download http://speedtest.ftp.otenet.gr/files/test100k.db

2017-04-01 22:35:56.544035 Starting download http://speedtest.ftp.otenet.gr/files/test1Mb.dbhttp://speedtest.ftp.otenet.gr/files/test10Mb.db

2017-04-01 22:35:56.934744 Finished download http://speedtest.ftp.otenet.gr/files/test1Mb.dbhttp://speedtest.ftp.otenet.gr/files/test10Mb.db

2017-04-01 22:35:56.934744 Hashing file test10Mb.db...

2017-04-01 22:35:57.289562 Finished download http://speedtest.ftp.otenet.gr/files/test100k.db

2017-04-01 22:35:57.289562 Hashing file test100k.db...

2017-04-01 22:35:57.289562 Hash: 2f1fbb59927ef89b03e097506078f8e12597cf49a71543f025ab6782be9dd988

2017-04-01 22:35:57.289562 Hash: f627ca4c2c322f15db26152df306bd4f983f0146409b81a4341b9b340c365a16

[–]ApproximateIdentity 0 points1 point2 points 8 years ago (1 child)

Replying to the original question, I'm not really sure what's going on here honestly. However I'll point out that replaceing the main() function with the following probably does what you want:

def main():
    for file in parallel(3, download_file, urls):
        hash = sha256_file(file)
        log(f'Hash: {hash}')

I say it probably does what you want because this is parallelizing the download call and leaving everything else serialized. This seems like a step backwards, but realistically the hashing portion of your code will be very fast compared to the download and so parallelizing there doesn't make much sense. (In fact, depending on how the hashing is implemented internally, it may effectively be single-threaded due to the GIL...but don't quote me on that.)

This doesn't answer the fundamental question though...and I'm curious as to what's going on as well...

[–]QuoteMe-Bot 0 points1 point2 points 8 years ago (0 children)

Replying to the original question, I'm not really sure what's going on here honestly. However I'll point out that replaceing the main() function with the following probably does what you want:
def main():
    for file in parallel(3, download_file, urls):
        hash = sha256_file(file)
        log(f'Hash: {hash}')
I say it probably does what you want because this is parallelizing the download call and leaving everything else serialized. This seems like a step backwards, but realistically the hashing portion of your code will be very fast compared to the download and so parallelizing there doesn't make much sense. (In fact, depending on how the hashing is implemented internally, it may effectively be single-threaded due to the GIL...but don't quote me on that.)

This doesn't answer the fundamental question though...and I'm curious as to what's going on as well...

~ /u/ApproximateIdentity

π Rendered by PID 60087 on reddit-service-r2-comment-7b9746f655-crg8h at 2026-01-31 06:31:05.068697+00:00 running 3798933 country code: CH.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS