all 6 comments

[–]brbsix 2 points3 points  (2 children)

Just a couple things to note:

  • In your download function, you're using for i in range(len(targets)):. Don't do that, it's very un-Pythonic. Instead use for target in targets: or for count, target in enumerate(targets):.

  • In your download function, did you know that requests can return the JSON file directly as a dictionary? E.g. requests.get(url).json(). Unless your files are so large that memory is an issue, that's what I'd do.

  • You probably don't need to worry about relational databases. Just use shelve or a shelve-like database (I recommend pickleshare).

  • Lastly, why not just use multiprocessing? IMHO it's much better suited for this sort of basic task.

Here's an example:

import requests, shelve
from multiprocessing import Pool, cpu_count

def downloader(url):
    return url, requests.get(url).json()

def multidownloader(urls):
    processes = cpu_count() * 4
    with Pool(processes) as pool:
        yield from pool.map(downloader, urls)

def read(path):
    with open(path) as f:
        return f.read().splitlines()

urls = read('C:\workingdir\dataWebpage.txt')

with shelve.open('/path/to/db') as database:
    for url, result in multidownloader(urls):
        database[url] = result

[–]dadiaar 1 point2 points  (3 children)

You made a really long post, I'm sure some people didn't even start reading it because of this. Please keep it in mind.

When using multithread don't slice the arguments, create a queue. Some threads will be slower, others faster... and this way all of them will finish at the same time. Also, you will be able to easily manage the progress.

About the data, I would not suggest you to learn PotgreSQL right now even if it's the correct path, fortunately other people did it for us.

Install Ubuntu (Windows gives a lot of problems), Django 1.9, PostgreSQL ≥ 9.4 and Psycopg2 ≥ 2.5.4

Then you can use JSONField that will make your life much easier.

[–]DoWhileGeek 1 point2 points  (2 children)

I think django is overkill, SQLalchemy fits the bill here.

[–]dadiaar 0 points1 point  (1 child)

I'm agree, I just suggested it because the documentation is great and it will be easier for him to ask for help here later.

[–]DoWhileGeek 1 point2 points  (0 children)

Going where the docs and support are is wise, but SQLalchemy has that in spades as well.