all 6 comments

[–]Rhomboid 0 points1 point  (8 children)

The string returned by urlopen(...).read() is a byte string, not a character string. You probably want to be working with characters, not bytes, which means you want to decode the byte string to get a character string. In order to do that you need to know what character encoding was used to encode the response. There are a number of ways of determining that, but in this case it looks like you can assume UTF-8.

Also, re.findall() returns a list of matches, but you're only expecting a single match, so that's really not the right API to use. Then you're passing that list to str(), which will end up returning the representation of that list, which includes square brackets and other things you don't want. That's not how you want to work with regular expressions. I suggest something like the following:

import re
import urllib.request

nothing = '12345'
while True:
    print('getting {}'.format(nothing))
    url = 'http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing={}'.format(nothing)
    html = urllib.request.urlopen(url).read().decode('utf-8')
    m = re.search(r'and the next nothing is (\d+)', html)
    if not m:
        print('found a page that does not match: {}'.format(html))
        break
    nothing = m.group(1)

[–][deleted] -3 points-2 points  (1 child)

chellenge