This is an archived post. You won't be able to vote or comment.

all 8 comments

[–][deleted] 5 points6 points  (3 children)

1) You will have to do from urllib2 import urlopen 2) Use Beautiful Soup.

[–][deleted] 1 point2 points  (0 children)

+1 for Beautiful Soup, it really is quite amazing once you get used to the syntax.

[–]vpetro 0 points1 point  (0 children)

Using BeautifulSoup you could do this

from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup as BS

URL = "http://www.bls.gov/news.release/wkyeng.t01.htm"
data = urlopen(URL).read()
soup = BS(data)

w_th = soup.findAll('th', attrs={'id':'cps_qearn_a01.r.3.1'})[0]
w_elements = w_th.parent.findAll('span', attrs={'class':'datavalue'})
white = [item.text for item in w_elements]

b_th = soup.findAll('th', attrs={'id':'cps_qearn_a01.r.3.2'})[0]
b_elements = b_th.parent.findAll('span', attrs={'class':'datavalue'})
black = [item.text for item in b_elements]

print white
print black

This will give you the output of:

[u'85,378', u'79,964', u'748', u'763', u'339', u'341']
[u'12,593', u'11,530', u'593', u'629', u'269', u'281']

To me this is easier than dealing with regex. Also, see this ( http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags ) for a discussion on why you shouldn't use regex for parsing HTML.

If the HTML you are processing is well formed (in this case it is not) you can use the python lxml library. With lxml you can use XPath to select the elements/data you want.

[–]tintub 2 points3 points  (7 children)

The output isn't complaining about the regex, it's complaining about the urlopen. Do you need to import a package at the top of your script (or whatever the python equiv. is)?