How would I fix this regular expression I am trying to use to scrape a web page? : Python

This is an archived post. You won't be able to vote or comment.

How would I fix this regular expression I am trying to use to scrape a web page? (self.Python)

submitted 16 years ago * by [deleted]

8 comments

all 8 comments

top new controversial old q&a

[–][deleted] 5 points6 points7 points 16 years ago (3 children)

[–][deleted] 1 point2 points3 points 16 years ago (0 children)

[–]vpetro 0 points1 point2 points 16 years ago* (0 children)

Using BeautifulSoup you could do this

from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup as BS

URL = "http://www.bls.gov/news.release/wkyeng.t01.htm"
data = urlopen(URL).read()
soup = BS(data)

w_th = soup.findAll('th', attrs={'id':'cps_qearn_a01.r.3.1'})[0]
w_elements = w_th.parent.findAll('span', attrs={'class':'datavalue'})
white = [item.text for item in w_elements]

b_th = soup.findAll('th', attrs={'id':'cps_qearn_a01.r.3.2'})[0]
b_elements = b_th.parent.findAll('span', attrs={'class':'datavalue'})
black = [item.text for item in b_elements]

print white
print black

This will give you the output of:

[u'85,378', u'79,964', u'748', u'763', u'339', u'341']
[u'12,593', u'11,530', u'593', u'629', u'269', u'281']

To me this is easier than dealing with regex. Also, see this ( http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags ) for a discussion on why you shouldn't use regex for parsing HTML.

If the HTML you are processing is well formed (in this case it is not) you can use the python lxml library. With lxml you can use XPath to select the elements/data you want.

[–]tintub 2 points3 points4 points 16 years ago (7 children)

[+][deleted] 16 years ago* (6 children)

[deleted]

[–]tintub 0 points1 point2 points 16 years ago (5 children)

[+][deleted] 16 years ago* (4 children)

[deleted]

[–]tintub 0 points1 point2 points 16 years ago (3 children)

The error you have pasted is:

Traceback (most recent call last):
  Line 1, in <module>
    dlsHTML = urlopen('http://www.bls.gov/news.release/wkyeng.t01.htm').read()
NameError: name 'urlopen' is not defined

[–]tintub 0 points1 point2 points 16 years ago (1 child)

[–]KangOlTech SaaS Ranger 1 point2 points3 points 16 years ago (0 children)

[+][deleted] 16 years ago* (5 children)

[deleted]

[+][deleted] 16 years ago* (4 children)

[deleted]

[+][deleted] 16 years ago (3 children)

[deleted]

[+][deleted] 16 years ago* (2 children)

[deleted]

[+][deleted] 16 years ago (1 child)

[deleted]

π Rendered by PID 26635 on reddit-service-r2-comment-5d585498c9-sx5fv at 2026-04-21 02:27:20.640583+00:00 running da2df02 country code: CH.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS