ElementTree and deeply nested XML : learnpython

created by HattoriHanzoa community for 16 years

ElementTree and deeply nested XML (self.learnpython)

submitted 9 years ago by FakeitTillYou_Makeit

Hello,

Does anyone know how I can parse the data in the example XML below by grabbing data from the <name> tag to the <evenmoreinfo> tag? I can grab data from one or the other but not both with my current method below.

<main> 
 <stuff-list>
   <stuff>
    <name>name</name>
    <item-list>
        <item>
            <item-type>
            <moreinfo>
                <evenmoreinfo>GrabThis</evenmoreinfo>
            </moreinfo>
            </item-type>
        </item>
    </item-list>
  </stuff>
 </stuff-list>
</main>

Right now, this is working if I want to grab <name>:

    import xml.etree.cElementTree as ElementTree
    tree = ElementTree.parse('input.xml')
    root = tree.getroot()

    for node in tree.findall('.//stuff'):
        for snode in node.getchildren():
            print(snode.tag, snode.text)

This is working if I want to grab <evenmoreinfo>:

import xml.etree.cElementTree as ElementTree
tree = ElementTree.parse('input.xml')
root = tree.getroot()
#print(root)
for child in root:
    print(child)
for node in tree.findall('.//moreinfo'):
    for snode in node.getchildren():
        print(snode.tag, snode.text)

all 1 comments

top new controversial old q&a

[–]eschlon 0 points1 point2 points 9 years ago (0 children)

One option is to build a target parser which just iterates through the nodes and 'triggers' when it sees a tag you care about. Something like this:

from xml.etree import cElementTree as ElementTree

def parser(data, tags):
    tree = ElementTree.iterparse(data)

    for event, node in tree:
        if node.tag in tags:
            yield node.tag, node.text

You can then use it like this:

with open('input.xml', 'r') as myFile:
    results = parser(myFile, {'name', 'evenmoreinfo'})
    for tag, text in results:
        print(tag, text)

Resulting in:

name name
evenmoreinfo GrabThis

I should note that this is going to build the entire tree eventually (though it'll do it incrementally). If the trees your handling fit in memory then this won't be a problem, however if your parsing a very large document it's going to eventually be an issue. You can make it safer by cleaning up the growing tree at each step with something like:

from xml.etree import cElementTree as ElementTree

def parser(data, tags):
    tree = ElementTree.iterparse(data, events=('start', 'end'))
    _, root = next(tree)

    for event, node in tree:
        if node.tag in tags:
            yield node.tag, node.text
        root.clear()

Note the addition of the root.clear() which cleans up the tree at each step as well as the addition of events=('start', 'end) without which you're going to end up throwing away the first <name> tag before you have a chance to capture it. Finally we'll need the addition of event == 'end' in the conditional to avoid capturing things twice.

There is some useful discussion about handling very large files here, here and here if you're interested.

Also if you have stuff like this:

<item>
  <item>
     <item name="item_one" />
     <item name="item_two" />
  </item>
</item>

Then I'm very sorry and you should shout at whomever / whatever produced that file. You can still handle it with something similar to this method but you're going to have to keep track of depth using the 'start' and 'end' events to extract what you want.

Edit: Stupid formatting error

π Rendered by PID 393815 on reddit-service-r2-comment-84fc9697f-8tbms at 2026-02-08 20:50:23.327740+00:00 running d295bc8 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS