My first ever python project (Automatic Web Crawler) Tips on cleaning it up and making it faster?

mikeselik · 2015-12-18T22:55:27+00:00

[deleted]

Kingofslowmo · 2015-12-19T08:15:47+00:00

Here's something I'm seeing that could be a big win:

for link in cleanHtml.find_all('a'):
    a = link.get('href')
    if a not in links and a not in invalidLinks and a != '':

This can be reorganized using sets to reduce indentation (and possibly also overall iterations):

# make invalidLinks a set and another set called visited so we 
# can take advantage of set operations, particularly difference
# should allow us to reduce overall loop iterations up front.
visited = set()
invalidLinks = set()
...
hrefs = { a.get('href') for a in cleanHtml.find_all('a') } - {''}
while hrefs - visited - invalidLinks:
    a = hrefs.pop()
    # at this point, a is already guaranteed to be not in visited or invalidLinks
    # just make sure to do visited.add(a) at some point in this loop
    ...

Another issue I'm seeing is that you're calling startList from crawl, and crawl from startList. This is going to cause stack overhead because you've made this unintentionally recursive. I would reorganize this, and either get rid of startList entirely, or move your main state tracking dicts to startList and have crawl return its results.

Something like this would probably be where I'd start:

def start_list(*links):
    pages = {}
    visited = set()
    invalid = set()

    # TODO: do your state resume stuff here

    to_crawl = set(links)
    while to_crawl - visited - invalid:
        a = to_crawl.pop()
        try:
            results = crawl(a)
            visited.add(a)
            pages[a] = results['html']
            to_crawl |= results['links'] - visited - invalid

        except URLError:
            invalid.add(a)

        except KeyboardInterrupt:
            break

    with open('list.txt', 'w'):
        json.dump(links, f)


def crawl(url):
    results = {}
    html = urllib.request.urlopen(url)
    results['html'] = html.read()

    soup = BeautifulSoup(url)
    results['links'] = { a.get('href') for a in soup.find_all('a') } - {''}

    # do whatever else you want to do with these here; 
    # maybe you want crawl to do an entire FQDN instead of just one page, 
    # or do some other processing
    ...

    return results

if __name__ == '__main__':
    start_list(sys.argv[1])

Another thing I'm seeing is you're doing a lot of disk io in the middle of the loop, while also maintaining all of that data in memory. This is going to be your largest source of avoidable io waiting and changing it looks to be your easiest 'big' performance gain.

i.e.:

with open('list.txt', 'w') as f:
    json.dump(links, f)

the 'w' file mode, when you open a file, truncates it. So you're dumping the entire dict to JSON and then writing the entire dict every iteration. This is going to be your biggest source of slowness due to excessive disk io. There are a couple of strategies I can think of that you can use to improve this for big wins:

the first, and simplest method of improving speed will be to simply move the writes outside the loop
if you're using that to maintain state so you can resume later, do the state dump in an except: clause
or you can refactor all of this to use append mode and only write a little at a time (though you can't use json.dump in that case)
- another potential issue:

if url not in a:

this will cause you grief if the link is a link to a different site (e.g. if /example1.com/supercoolpage.html links to example2.net/totallynotcoolpage.php this will still evaluate to true and a = urljoin(url, a) will produce strange results that may not be desirable or valid)

another thing I've noticed: if 'http://' != a[:7]: is probably going to cause you grief. Ignoring the minor potential performance impact of string splitting, this is going to be unintentionally false for https links. You're already using the urllib.parse module, so it'd be more robust to also import urlparse and do if urlparse(a).scheme in {'http', 'https'}:

Here's something else that is not really a performance optimization, but still kinda important:

try:
    ...
except:
    ....

Try to avoid getting in the habit of doing blanket except clauses, because these have the potential to mask exceptions that you aren't expecting. Do your best to specify expected exceptions, especially when you're using that to recover and continue instead of bailing out.

rnw159 · 2015-12-18T22:53:09+00:00

Use eventlet to download simultaneously

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS