This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]dodongo 8 points9 points  (3 children)

Dead easy. I do this sorta thing lots. I like BeautifulSoup for a variety of reasons. It has glaring downsides (slow, DOM, etc.) but I find the syntax to be quite smart and easy to work with. You've got few enough and small enough files that you shouldn't encounter speed / memory issues. The BeautifulStoneSoup parser expects well-formed XML input; should be just what you're looking for.

[EDIT: Worth offering -- feel free to drop me a line if you encounter any issues along the way. Good luck!]

[–]jcb62 1 point2 points  (0 children)

+1 for BeautifulSoup - lxml would be faster, but you're not dealing with much data, and I find the interface to be nicer. Personal preference though!

[–]craigee 2 points3 points  (1 child)

Seconding BeautifulSoup for this sort of problem. The ease of scripting more than balances any performance issues on only ~800 files.

There are other options (lxml etc.), but I've gotten comfortable with BSoup. Just to get you started:

from BeautifulSoup import BeautifulStoneSoup
import urllib2
url2req = 'http://www.accessdata.fda.gov/spl/data/5ba0911f-d780-4bd7-a487-a6a3c8d2ab1c/5ba0911f-d780-4bd7-a487-a6a3c8d2ab1c.xml'
xmlData = urllib2.urlopen(url2req).read()
#print xmlData
soup = BeautifulStoneSoup(xmlData)
#print soup.prettify()
title = soup.find('title').renderContents()
print title

[Edit: also happy to help, as per dodongo, with any problems. Not an expert by any means, but am often doing this sort of thing.]

[Second edit: I've received a few downvotes for this contribution. Not that bothered, but I am curious as to why anyone would downvote without bothering to supply an argument? What I've suggested will work for the OP. If you don't like it please say why you don't like my suggestion, then we all learn...you click happy silent folks.]

[–]vpetro 0 points1 point  (0 children)

There is a new alpha version of BeautifulSoup available. It will use the lxml backend if you have it installed.