Question 1

Given the simple XML data below:

<book><title>My First Book</title><abstract><para>First paragraph of the abstract</para><para>Second paragraph of the abstract</para></abstract><keywordSet><keyword>First keyword</keyword><keyword>Second keyword</keyword><keyword>Third keyword</keyword></keywordSet>
</book>

How can I traverse the tree, using lxml, and get all paragraphs in the "abstract" element, as well as all keywords in the "keywordSet" element?

The code snippet below returns only the first line of text in each element:

from lxml import objectify
root = objectify.fromstring(xml_string) # xml_string contains the XML data above
print root.title # returns the book title
for line in root.abstract:print line.para # returns only yhe first paragraph
for word in root.keywordSet:print word.keyword # returns only the first keyword in the set

I tried to follow this example, but the code above doesn't work as expected.

On a different tack, still better would be able to read the entire XML tree into a Python dictionary, with each element as the key and each text as the element item(s). I found out that something like this might be possible using lxml objectify, but I couldn't figure out how to achieve it.

One really big problem I have been finding when attempting to write XML parsing code in Python is that most of the "examples" provided are just too simple and entirely fictitious to be of much help -- or else they are just the opposite, using too complicated automatically-generated XML data!

Could anybody give me a hint?

Thanks in advance!

EDIT: After posting this question, I found a simple solution here.

So, my updated code becomes:

from lxml import objectifyroot = objectify.fromstring(xml_string) # xml_string contains the XML data aboveprint root.title # returns the book titlefor para in root.abstract.iterchildren():print para # now returns the text of all paragraphsfor keyword in root.keywordSet.iterchildren():print keyword # now returns all keywords in the set

Question 2

This is pretty simple using XPath:

from lxml import etreetree = etree.parse('data.xml')paragraphs = tree.xpath('/abstract/para/text()')
keywords = tree.xpath('/keywordSet/keyword/text()')print paragraphs
print keywords

Output:

['First paragraph of the abstract', 'Second paragraph of the abstract']
['First keyword', 'Second keyword', 'Third keyword']

See the XPath Tutorial at W3Schools for details on the XPath syntax.

In particular, the elements used in the expressions above use

The / selector to select the root node / the immediate children.
The text() operator to select the text node (the "textual content") of the respective elements.

Here's how it could be done using the Objectify API:

from lxml import objectifyroot = objectify.fromstring(xml_string)paras = [p.text for p in root.abstract.para]
keywords = [k.text for k in root.keywordSet.keyword]print paras
print keywords

~~It seems that root.abstract.para is actually shorthand for root.abstract.para[0]. So you need to explicitly use element.iterchildren() to access all child elements.~~

That's not true, we obviously both misunderstood the Objectify API: In order to iterate over the paras in abstract, you need to iterate over root.abstract.para, not root.abstract itself. It's weird, because you intuitively think about abstract as a collection or a container for its nodes, and that container would be represented by a Python iterable. But it's actually the .para selector that represents the sequence.

Handling nested elements with Python lxml

Related Q&A

Easiest way to plot data on country map with python

How to resize QMainWindow after removing all DockWidgets?

Python: sorting a list by column [duplicate]

How to make setuptools clone git dependencies recursively?

Stable sorting in Jinja2

Factor to complex roots using sympy

Using multiple custom classes with Pipeline sklearn (Python)

Python equivalent of pointers

Pip is broken, gives PermissionError: [WinError 32]

Pandas highlight rows based on index name