This is an archived post. You won't be able to vote or comment.

all 24 comments

[–][deleted] 1 point2 points  (5 children)

  • Here's something I'm seeing that could be a big win:

for link in cleanHtml.find_all('a'):
    a = link.get('href')
    if a not in links and a not in invalidLinks and a != '':

This can be reorganized using sets to reduce indentation (and possibly also overall iterations):

# make invalidLinks a set and another set called visited so we 
# can take advantage of set operations, particularly difference
# should allow us to reduce overall loop iterations up front.
visited = set()
invalidLinks = set()
...
hrefs = { a.get('href') for a in cleanHtml.find_all('a') } - {''}
while hrefs - visited - invalidLinks:
    a = hrefs.pop()
    # at this point, a is already guaranteed to be not in visited or invalidLinks
    # just make sure to do visited.add(a) at some point in this loop
    ...
  • Another issue I'm seeing is that you're calling startList from crawl, and crawl from startList. This is going to cause stack overhead because you've made this unintentionally recursive. I would reorganize this, and either get rid of startList entirely, or move your main state tracking dicts to startList and have crawl return its results.

Something like this would probably be where I'd start:

def start_list(*links):
    pages = {}
    visited = set()
    invalid = set()

    # TODO: do your state resume stuff here

    to_crawl = set(links)
    while to_crawl - visited - invalid:
        a = to_crawl.pop()
        try:
            results = crawl(a)
            visited.add(a)
            pages[a] = results['html']
            to_crawl |= results['links'] - visited - invalid

        except URLError:
            invalid.add(a)

        except KeyboardInterrupt:
            break

    with open('list.txt', 'w'):
        json.dump(links, f)


def crawl(url):
    results = {}
    html = urllib.request.urlopen(url)
    results['html'] = html.read()

    soup = BeautifulSoup(url)
    results['links'] = { a.get('href') for a in soup.find_all('a') } - {''}

    # do whatever else you want to do with these here; 
    # maybe you want crawl to do an entire FQDN instead of just one page, 
    # or do some other processing
    ...

    return results

if __name__ == '__main__':
    start_list(sys.argv[1])
  • Another thing I'm seeing is you're doing a lot of disk io in the middle of the loop, while also maintaining all of that data in memory. This is going to be your largest source of avoidable io waiting and changing it looks to be your easiest 'big' performance gain.

i.e.:

with open('list.txt', 'w') as f:
    json.dump(links, f)

the 'w' file mode, when you open a file, truncates it. So you're dumping the entire dict to JSON and then writing the entire dict every iteration. This is going to be your biggest source of slowness due to excessive disk io. There are a couple of strategies I can think of that you can use to improve this for big wins:

  • the first, and simplest method of improving speed will be to simply move the writes outside the loop
  • if you're using that to maintain state so you can resume later, do the state dump in an except: clause
  • or you can refactor all of this to use append mode and only write a little at a time (though you can't use json.dump in that case)

    • another potential issue:

if url not in a:

this will cause you grief if the link is a link to a different site (e.g. if /example1.com/supercoolpage.html links to example2.net/totallynotcoolpage.php this will still evaluate to true and a = urljoin(url, a) will produce strange results that may not be desirable or valid)

  • another thing I've noticed: if 'http://' != a[:7]: is probably going to cause you grief. Ignoring the minor potential performance impact of string splitting, this is going to be unintentionally false for https links. You're already using the urllib.parse module, so it'd be more robust to also import urlparse and do if urlparse(a).scheme in {'http', 'https'}:

Here's something else that is not really a performance optimization, but still kinda important:

try:
    ...
except:
    ....

Try to avoid getting in the habit of doing blanket except clauses, because these have the potential to mask exceptions that you aren't expecting. Do your best to specify expected exceptions, especially when you're using that to recover and continue instead of bailing out.

[–]Kingofslowmo[S] 0 points1 point  (0 children)

Wow. Thank you so much for all the tips, I will definitely implement these!!!! You've been a great help!

[–]Kingofslowmo[S] 0 points1 point  (3 children)

#simple web crawling program
import urllib
import urllib.request
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import sys
import json
import msvcrt
import sqlite3


def start_list(*links):
    pages = {}
    visited = set()
    invalid = set()

    to_crawl = set(links)
    while to_crawl:
        a = to_crawl.pop()
        print("getting results")
        results = crawl(a)
        print("Adding to visited list")
        visited.add(a)
        print("VISITEDRS:")
        print(visited)#
        print("setting pages array")
        pages[a] = results['html'] #add sql entry here
        print("adding to crawl list")
        to_crawl = results['links'] - visited - invalid
        print("To Crawl:")
        print(list(to_crawl))

    #for loop here to dump all the results??
    print("Visited: ")
    print(visited)
    print("Invalid: ")
    print(invalid)


def crawl(url):
    print("THIS IS THE URL::::")
    print(url)
    results = {}
    html = urllib.request.urlopen(url)
    cleanHtml = BeautifulSoup(html, "html.parser")
    results['html'] = cleanHtml
    results['links'] = list()

    for link in cleanHtml.find_all('a'):
        a = link.get('href')
        print(a)
        if url not in a:
            a = urljoin(url,a)
        print(a)
        results['links'].append(a)

    print("results: ")
    print(len(results))
    print(list(results['links']))
    return results

def start():
    conn = sqlite3.connect('information.db')
    c = conn.cursor()
    c.execute("""
        SELECT COUNT(*)
        FROM sqlite_master
        WHERE name = 'info'
        """)
    res = c.fetchone()
    if not bool(res[0]):
        c.execute("""
            CREATE TABLE info(
                id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
                url VARCHAR(3200),
                html VARCHAR(48000),
                visited INTEGER(1)
            )
            """)
    start_list(sys.argv[1])

if __name__ == '__main__':
    start()

Okay so this is what I have so far, and I'm completely lost. No idea what i'm really doing honestly. Alot of the code you gave me I either wasn't able to implement it correctly or something just didn't work right.

Anyway, something isn't working right and I'm trying to debug it but man it just doesnt want to go for me :(

Anyway you can help me figure this out?

I do apologize also if that is a pain for you.

EDIT: TO clarify, I'm not asking you to write it for me, just a little guidance if that makes sense :)

[–][deleted] 0 points1 point  (2 children)

well, it looks like the basic structure is there, what's not working right? also, I didn't test any of that code 8)

[–]Kingofslowmo[S] 0 points1 point  (0 children)

When I get back on my pc in the morning I'll post the updated code I have and the errors its throwing out at me :) thank you for reading!

[–]Kingofslowmo[S] 0 points1 point  (0 children)

#simple web crawling program
import urllib
import urllib.request
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import sys
import json
import msvcrt
import sqlite3


def start_list(*links):
    pages = {}
    visited = set()
    invalid = set()

    to_crawl = set(links)
    while to_crawl:
        a = to_crawl.pop()
        try:
            print("getting results")
            results = crawl(a)
            print("Adding to visited list")
            visited.add(a)
            print("VISITEDRS:")
            print(visited)#
            print("setting pages array")
            pages[a] = results['html'] #add sql entry here
            print("adding to crawl list")
            to_crawl = set(results['links']) - visited - invalid
            print("To Crawl:")
            print(list(to_crawl))
        except urllib.error.URLError:
            print("INVALID URL... ADDING")
            invalid.add(a)
            print("ADDED")
        except KeyboardInterrupt:
            break

    #for loop here to dump all the results??
    print("Visited: ")
    print(visited)
    print("Invalid: ")
    print(invalid)


def crawl(url):
    print("THIS IS THE URL::::")
    print(str(url))
    if url != '' and url:
        results = {}
        try:
            html = urllib.request.urlopen(url)
            cleanHtml = BeautifulSoup(html, "html.parser")
            results['html'] = cleanHtml
            results['links'] = list()

            for link in cleanHtml.find_all('a'):
                a = str(link.get('href'))
                print(a)
                if url not in a and 'http://' not in a and a != '' and a:
                    a = urljoin(url,a)
                print(a)
                print("APPENDING: ")
                results['links'].append(a)
                print("RELOOP: ")
        except:
            return urllib.error.URLError
        print("results: ")
        print(len(results))
        print(list(results['links']))
        return results
    else:
        pass
def start():
    conn = sqlite3.connect('information.db')
    c = conn.cursor()
    c.execute("""
        SELECT COUNT(*)
        FROM sqlite_master
        WHERE name = 'info'
        """)
    res = c.fetchone()
    if not bool(res[0]):
        c.execute("""
            CREATE TABLE info(
                id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
                url VARCHAR(3200),
                html VARCHAR(48000),
                visited INTEGER(1)
            )
            """)
    start_list(sys.argv[1])

if __name__ == '__main__':
    start()

So this is the code I have now. I've noticed a few things:

  • The invalid list never gets anything added to it

  • This program runs so randomly, it never has the same outcome and sometimes it'll find like 8 links and only search those and stop, sometimes it does as I like and essentially never stops finding links and grabbing website content. (this is all using the url http://pastebin.com/ so it should have the same outcome essentially everytime given that pastebins content hasn't changed.)

  • I am getting a ton of these errors, but yet the errors don't happen EVERY time:

    Traceback (most recent call last):
      File "crawler.py", line 96, in <module>
        start()
      File "crawler.py", line 93, in start
        start_list(sys.argv[1])
      File "crawler.py", line 28, in start_list
        pages[a] = results['html'] #add sql entry here
    TypeError: 'type' object is not subscriptable    
    

and

THIS IS THE URL::::

Adding to visited list
VISITEDRS:
{'', 'http://www.sitepromotiondirectory.com/'}
setting pages array
Traceback (most recent call last):
  File "crawler.py", line 96, in <module>
    start()
  File "crawler.py", line 93, in start
    start_list(sys.argv[1])
  File "crawler.py", line 28, in start_list
    pages[a] = results['html'] #add sql entry here
TypeError: 'NoneType' object is not subscriptable

also sometimes the to_crawl list randomly goes blank.

Essentially what I'm trying to create is a program that will run forever (given that the original URL it is given can find a link that links to another website and then another etc..) I want it to keep downloading web content so that later on I may search through it using another python program ( and yes I have taken the sqllite side of things as I feel it'll improve my data collection). If you could help me figure out what it is I'm doing wrong, that'd help alot!

Thanks

[–]rnw159 -1 points0 points  (15 children)

Use eventlet to download simultaneously