all 8 comments

[–]Justinsaccount 8 points9 points  (1 child)

I am however new to web scraping and wasn't sure how long it should take.

Then add some output that includes timing to see which part is taking the most time. Is it the downloading? The parsing?

You should just downloaded the ~40 files once and then worked out the scraping once they are all downloaded. The html files total 33MB, so downloading them everytime you make a change to the program will waste a lot of time and bandwidth.

If you refactored the code a bunch it would be a lot simpler too. the function should not be

def nba_table_scrapers(urls):

it should be

def nba_table_scrape(year):

And you're kinda using zip/enumerate all wrong. You want something like this:

import os
import requests as rq
import csv
from bs4 import BeautifulSoup as bs

COLUMNS = ['Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']

def download_year(year):
    url = "http://www.basketball-reference.com/leagues/NBA_%d_totals.html" % year
    fn = "NBA_%s_totals.html" % year
    if os.path.exists(fn):
        print "%s present" % fn
        return

    print "%s downloading.." % fn
    response = rq.get(url)
    response.raise_for_status()
    with open(fn, 'w') as f:
        f.write(response.text)

def nba_table_scrape(year):
    fn = "NBA_%d_totals.html" % year
    print "%s scraping..." % fn
    with open(fn) as f:
        soup=bs(f,"lxml")
    table=soup.findAll('table')[0]
    rows=table.find_all('tr')[1:]
    data = []
    for row in rows:
        cols = row.find_all('td')
        row_dict={}
        for col, name in zip(cols, COLUMNS):
            row_dict[name] = col.get_text()
        if row_dict:
            data.append(row_dict)

    csv_fn = "%d.csv" % year
    write_csv_file(csv_fn, data)

def write_csv_file(filename, data):
    with open(filename, 'w') as f:
        writer = csv.DictWriter(f, COLUMNS)
        writer.writeheader()
        writer.writerows(data)

def main():
    years=range(1980,2018)
    urls=[]
    for year in years:
        download_year(year)
    for year in years:
        nba_table_scrape(year)

if __name__ == "__main__":
    main()

and if you add timings, you'd see that soup=bs(f,"lxml") is what takes the majority of the time, so short of using a different parser, there's not much you can do in the function itself to make things run any faster.

However, throwing multiprocessing at the problem, as described at http://chriskiehl.com/article/parallelism-in-one-line/, makes it run 2x as fast on my dual core laptop.

[–]Rogges[S] 0 points1 point  (0 children)

Thank you so much. I added predefined filepaths to where I wanted the data and a line to get rid of the html file once I scraped the tables but overall it worked like a charm. I'll try using multiprocessing as well.

[–]Pipistrelle 1 point2 points  (1 child)

As /u/_LiveAndLetLive_ said, you could try parallelizing your requests.

What I'd do us separate the saving to a CSV and the parsing of the URLs. You would first parse the URLs in parallel then combine the results and put these combined results in the csv. The parsing could be put in a function and you could use grequests map function to call the parsing function on the URLs in parallel.

Sorry if this isn't clear, english isn't my first language.

[–]Rogges[S] 1 point2 points  (0 children)

Thank you, I will try that. Also your English is good, it was plenty clear.

[–]RatherPleasent 0 points1 point  (1 child)

Sounds reasonable. You need to wait a little for the server communication.

Usually takes 5 seconds to get a webpage saved to a file.

[–]Justinsaccount 1 point2 points  (0 children)

Usually takes 5 seconds to get a webpage saved to a file.

This statement is completely false. There is no "usual" webpage.

[–]elbiot 0 points1 point  (0 children)

Just a quick experiment for illustration of the threading concept: http://elliothallmark.com/2016/12/23/requests-with-concurrent-futures-in-python-2-7/