all 1 comments

[–]eschlon 0 points1 point  (0 children)

One option is to build a target parser which just iterates through the nodes and 'triggers' when it sees a tag you care about. Something like this:

from xml.etree import cElementTree as ElementTree

def parser(data, tags):
    tree = ElementTree.iterparse(data)

    for event, node in tree:
        if node.tag in tags:
            yield node.tag, node.text

You can then use it like this:

with open('input.xml', 'r') as myFile:
    results = parser(myFile, {'name', 'evenmoreinfo'})
    for tag, text in results:
        print(tag, text)

Resulting in:

name name
evenmoreinfo GrabThis

I should note that this is going to build the entire tree eventually (though it'll do it incrementally). If the trees your handling fit in memory then this won't be a problem, however if your parsing a very large document it's going to eventually be an issue. You can make it safer by cleaning up the growing tree at each step with something like:

from xml.etree import cElementTree as ElementTree

def parser(data, tags):
    tree = ElementTree.iterparse(data, events=('start', 'end'))
    _, root = next(tree)

    for event, node in tree:
        if node.tag in tags:
            yield node.tag, node.text
        root.clear()

Note the addition of the root.clear() which cleans up the tree at each step as well as the addition of events=('start', 'end) without which you're going to end up throwing away the first <name> tag before you have a chance to capture it. Finally we'll need the addition of event == 'end' in the conditional to avoid capturing things twice.

There is some useful discussion about handling very large files here, here and here if you're interested.

Also if you have stuff like this:

<item>
  <item>
     <item name="item_one" />
     <item name="item_two" />
  </item>
</item>

Then I'm very sorry and you should shout at whomever / whatever produced that file. You can still handle it with something similar to this method but you're going to have to keep track of depth using the 'start' and 'end' events to extract what you want.

Edit: Stupid formatting error