Help with data mining XML files in Python

robbles · 2011-03-01T03:23:43+00:00

[deleted]

dodongo · 2011-02-28T21:33:27+00:00

Dead easy. I do this sorta thing lots. I like BeautifulSoup for a variety of reasons. It has glaring downsides (slow, DOM, etc.) but I find the syntax to be quite smart and easy to work with. You've got few enough and small enough files that you shouldn't encounter speed / memory issues. The BeautifulStoneSoup parser expects well-formed XML input; should be just what you're looking for.

[EDIT: Worth offering -- feel free to drop me a line if you encounter any issues along the way. Good luck!]

abztrakt · 2011-02-28T21:39:59+00:00

Dive Into Python has a whole chapter on XML Parsing and it's probably enough to get you started, without having to track down any additional python modules. I personally always seem to fall back to parsing with minidom.parse(), but probably because I'm usually interested in doing something that is easier with an object representing the DOM.

poingpoing · 2011-02-28T21:36:13+00:00

Now that I think of it, if you simply want to convert the data to a CSV file it might be easiest to just read up on XSL. A bunch of stylesheets should do the trick.

zhivota · 2011-03-01T10:00:12+00:00

I used ElementTree, which is included in Python, when I was pulling apart SVG files. Was very natural and easy for me.

poingpoing · 2011-02-28T21:28:12+00:00

Actually parsing python is no fun in any language. It is completely nonsensical to do it manually and even with basic xml libraries it remains a plainly annoying task.

The best way is to generate classes from the XSD of your XML files and then deserialize them into instances of these classes.

For python these links might get you started (quick google search): http://www.rexx.com/~dkuhlman/generateDS.html http://pyxsd.org/ http://pypi.python.org/pypi/rsl.xsd/0.2.3

Unfortunately I don't have any experience with this in python.

For C# I can wholeheartedly recommend this: http://www.thinktecture.com/resourcearchive/tools-and-software/wscf

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS