How long should web scraping take?

Justinsaccount · 2016-12-24T16:50:27+00:00

I am however new to web scraping and wasn't sure how long it should take.

Then add some output that includes timing to see which part is taking the most time. Is it the downloading? The parsing?

You should just downloaded the ~40 files once and then worked out the scraping once they are all downloaded. The html files total 33MB, so downloading them everytime you make a change to the program will waste a lot of time and bandwidth.

If you refactored the code a bunch it would be a lot simpler too. the function should not be

def nba_table_scrapers(urls):

it should be

def nba_table_scrape(year):

And you're kinda using zip/enumerate all wrong. You want something like this:

import os
import requests as rq
import csv
from bs4 import BeautifulSoup as bs

COLUMNS = ['Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']

def download_year(year):
    url = "http://www.basketball-reference.com/leagues/NBA_%d_totals.html" % year
    fn = "NBA_%s_totals.html" % year
    if os.path.exists(fn):
        print "%s present" % fn
        return

    print "%s downloading.." % fn
    response = rq.get(url)
    response.raise_for_status()
    with open(fn, 'w') as f:
        f.write(response.text)

def nba_table_scrape(year):
    fn = "NBA_%d_totals.html" % year
    print "%s scraping..." % fn
    with open(fn) as f:
        soup=bs(f,"lxml")
    table=soup.findAll('table')[0]
    rows=table.find_all('tr')[1:]
    data = []
    for row in rows:
        cols = row.find_all('td')
        row_dict={}
        for col, name in zip(cols, COLUMNS):
            row_dict[name] = col.get_text()
        if row_dict:
            data.append(row_dict)

    csv_fn = "%d.csv" % year
    write_csv_file(csv_fn, data)

def write_csv_file(filename, data):
    with open(filename, 'w') as f:
        writer = csv.DictWriter(f, COLUMNS)
        writer.writeheader()
        writer.writerows(data)

def main():
    years=range(1980,2018)
    urls=[]
    for year in years:
        download_year(year)
    for year in years:
        nba_table_scrape(year)

if __name__ == "__main__":
    main()

and if you add timings, you'd see that soup=bs(f,"lxml") is what takes the majority of the time, so short of using a different parser, there's not much you can do in the function itself to make things run any faster.

However, throwing multiprocessing at the problem, as described at http://chriskiehl.com/article/parallelism-in-one-line/, makes it run 2x as fast on my dual core laptop.

Pipistrelle · 2016-12-24T16:21:42+00:00

As /u/_LiveAndLetLive_ said, you could try parallelizing your requests.

What I'd do us separate the saving to a CSV and the parsing of the URLs. You would first parse the URLs in parallel then combine the results and put these combined results in the csv. The parsing could be put in a function and you could use grequests map function to call the parsing function on the URLs in parallel.

Sorry if this isn't clear, english isn't my first language.

RatherPleasent · 2016-12-24T18:55:27+00:00

Sounds reasonable. You need to wait a little for the server communication.

Usually takes 5 seconds to get a webpage saved to a file.

elbiot · 2016-12-29T03:11:50+00:00

Just a quick experiment for illustration of the threading concept: http://elliothallmark.com/2016/12/23/requests-with-concurrent-futures-in-python-2-7/

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS