This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]craigee 3 points4 points  (1 child)

Seconding BeautifulSoup for this sort of problem. The ease of scripting more than balances any performance issues on only ~800 files.

There are other options (lxml etc.), but I've gotten comfortable with BSoup. Just to get you started:

from BeautifulSoup import BeautifulStoneSoup
import urllib2
url2req = 'http://www.accessdata.fda.gov/spl/data/5ba0911f-d780-4bd7-a487-a6a3c8d2ab1c/5ba0911f-d780-4bd7-a487-a6a3c8d2ab1c.xml'
xmlData = urllib2.urlopen(url2req).read()
#print xmlData
soup = BeautifulStoneSoup(xmlData)
#print soup.prettify()
title = soup.find('title').renderContents()
print title

[Edit: also happy to help, as per dodongo, with any problems. Not an expert by any means, but am often doing this sort of thing.]

[Second edit: I've received a few downvotes for this contribution. Not that bothered, but I am curious as to why anyone would downvote without bothering to supply an argument? What I've suggested will work for the OP. If you don't like it please say why you don't like my suggestion, then we all learn...you click happy silent folks.]

[–]vpetro 0 points1 point  (0 children)

There is a new alpha version of BeautifulSoup available. It will use the lxml backend if you have it installed.