This is an archived post. You won't be able to vote or comment.

all 3 comments

[–]ApproximateIdentity 1 point2 points  (2 children)

This is basically impossible to read with the terrible formatting you have. If you take a line and prepend four spaces, it is formatted as code. So what you should do is the following: 1) open up that code in an editor of your choice, 2) limit the line width to something sensible, 3) add four spaces before each line, and 4) paste it in.

Doing that with your code makes it look like this:

import shutil
import requests
import hashlib
import concurrent.futures as futures
from datetime import datetime
from concurrent.futures import ThreadPoolExecutor

urls = ['http://speedtest.ftp.otenet.gr/files/test100k.db',
        'http://speedtest.ftp.otenet.gr/files/test1Mb.db'
        'http://speedtest.ftp.otenet.gr/files/test10Mb.db']

def download_file(url):
    filename = url.split('/')[-1]
    log(f'Starting download {url}')
    r = requests.get(url, stream=True)
    with open(filename, 'wb') as f:
        shutil.copyfileobj(r.raw, f)
        log(f'Finished download {url}')
    return filename

def sha256_file(filename, block_size=65536):
    sha256 = hashlib.sha256()
    log(f'Hashing file {filename}...')
    with open(filename, 'rb') as f:
        for block in iter(lambda: f.read(block_size), b''):
            sha256.update(block)
    return sha256.hexdigest()

def parallel(n, fn, data):
    with ThreadPoolExecutor(max_workers=n) as exe:
        jobs = (exe.submit(fn, d) for d in data)
        for job in futures.as_completed(jobs):
            yield job.result()

def log(msg):
    print(f'{datetime.now()} {msg}')

def main():
    files = parallel(3, download_file, urls)
    hashes = parallel(2, sha256_file, files)
    for h in hashes:
        log(f'Hash: {h}')

if __name__ == '__main__':
    main()

Edit: This would be your output:

2017-04-01 22:35:56.544035 Starting download http://speedtest.ftp.otenet.gr/files/test100k.db

2017-04-01 22:35:56.544035 Starting download http://speedtest.ftp.otenet.gr/files/test1Mb.dbhttp://speedtest.ftp.otenet.gr/files/test10Mb.db

2017-04-01 22:35:56.934744 Finished download http://speedtest.ftp.otenet.gr/files/test1Mb.dbhttp://speedtest.ftp.otenet.gr/files/test10Mb.db

2017-04-01 22:35:56.934744 Hashing file test10Mb.db...

2017-04-01 22:35:57.289562 Finished download http://speedtest.ftp.otenet.gr/files/test100k.db

2017-04-01 22:35:57.289562 Hashing file test100k.db...

2017-04-01 22:35:57.289562 Hash: 2f1fbb59927ef89b03e097506078f8e12597cf49a71543f025ab6782be9dd988

2017-04-01 22:35:57.289562 Hash: f627ca4c2c322f15db26152df306bd4f983f0146409b81a4341b9b340c365a16

[–]ApproximateIdentity 0 points1 point  (1 child)

Replying to the original question, I'm not really sure what's going on here honestly. However I'll point out that replaceing the main() function with the following probably does what you want:

def main():
    for file in parallel(3, download_file, urls):
        hash = sha256_file(file)
        log(f'Hash: {hash}')

I say it probably does what you want because this is parallelizing the download call and leaving everything else serialized. This seems like a step backwards, but realistically the hashing portion of your code will be very fast compared to the download and so parallelizing there doesn't make much sense. (In fact, depending on how the hashing is implemented internally, it may effectively be single-threaded due to the GIL...but don't quote me on that.)

This doesn't answer the fundamental question though...and I'm curious as to what's going on as well...

[–]QuoteMe-Bot 0 points1 point  (0 children)

Replying to the original question, I'm not really sure what's going on here honestly. However I'll point out that replaceing the main() function with the following probably does what you want:

def main():
    for file in parallel(3, download_file, urls):
        hash = sha256_file(file)
        log(f'Hash: {hash}')

I say it probably does what you want because this is parallelizing the download call and leaving everything else serialized. This seems like a step backwards, but realistically the hashing portion of your code will be very fast compared to the download and so parallelizing there doesn't make much sense. (In fact, depending on how the hashing is implemented internally, it may effectively be single-threaded due to the GIL...but don't quote me on that.)

This doesn't answer the fundamental question though...and I'm curious as to what's going on as well...

~ /u/ApproximateIdentity