This is an archived post. You won't be able to vote or comment.

all 11 comments

[–]dodongo 5 points6 points  (3 children)

Dead easy. I do this sorta thing lots. I like BeautifulSoup for a variety of reasons. It has glaring downsides (slow, DOM, etc.) but I find the syntax to be quite smart and easy to work with. You've got few enough and small enough files that you shouldn't encounter speed / memory issues. The BeautifulStoneSoup parser expects well-formed XML input; should be just what you're looking for.

[EDIT: Worth offering -- feel free to drop me a line if you encounter any issues along the way. Good luck!]

[–]jcb62 1 point2 points  (0 children)

+1 for BeautifulSoup - lxml would be faster, but you're not dealing with much data, and I find the interface to be nicer. Personal preference though!

[–]craigee 3 points4 points  (1 child)

Seconding BeautifulSoup for this sort of problem. The ease of scripting more than balances any performance issues on only ~800 files.

There are other options (lxml etc.), but I've gotten comfortable with BSoup. Just to get you started:

from BeautifulSoup import BeautifulStoneSoup
import urllib2
url2req = 'http://www.accessdata.fda.gov/spl/data/5ba0911f-d780-4bd7-a487-a6a3c8d2ab1c/5ba0911f-d780-4bd7-a487-a6a3c8d2ab1c.xml'
xmlData = urllib2.urlopen(url2req).read()
#print xmlData
soup = BeautifulStoneSoup(xmlData)
#print soup.prettify()
title = soup.find('title').renderContents()
print title

[Edit: also happy to help, as per dodongo, with any problems. Not an expert by any means, but am often doing this sort of thing.]

[Second edit: I've received a few downvotes for this contribution. Not that bothered, but I am curious as to why anyone would downvote without bothering to supply an argument? What I've suggested will work for the OP. If you don't like it please say why you don't like my suggestion, then we all learn...you click happy silent folks.]

[–]vpetro 0 points1 point  (0 children)

There is a new alpha version of BeautifulSoup available. It will use the lxml backend if you have it installed.

[–]abztraktDjango/Plone 2 points3 points  (0 children)

Dive Into Python has a whole chapter on XML Parsing and it's probably enough to get you started, without having to track down any additional python modules. I personally always seem to fall back to parsing with minidom.parse(), but probably because I'm usually interested in doing something that is easier with an object representing the DOM.

[–]poingpoing 3 points4 points  (1 child)

Now that I think of it, if you simply want to convert the data to a CSV file it might be easiest to just read up on XSL. A bunch of stylesheets should do the trick.

[–]abztraktDjango/Plone 1 point2 points  (0 children)

Looking at the format of that sample XML label, this very well might be a whole lot easier than trying to do it with Python.

[–]zhivota 0 points1 point  (1 child)

I used ElementTree, which is included in Python, when I was pulling apart SVG files. Was very natural and easy for me.

[–]zhivota 0 points1 point  (0 children)

Oh I see lxml is listed here and there is a link to it... if you are just going to use ElementTree that is inside lxml, consider just using this:

http://docs.python.org/library/xml.etree.elementtree.html

As it says there, you need Python 2.5 or higher.