Handling nested elements with Python lxml

2024/10/6 21:00:09

Given the simple XML data below:

<book><title>My First Book</title><abstract><para>First paragraph of the abstract</para><para>Second paragraph of the abstract</para></abstract><keywordSet><keyword>First keyword</keyword><keyword>Second keyword</keyword><keyword>Third keyword</keyword></keywordSet>
</book>

How can I traverse the tree, using lxml, and get all paragraphs in the "abstract" element, as well as all keywords in the "keywordSet" element?

The code snippet below returns only the first line of text in each element:

from lxml import objectify
root = objectify.fromstring(xml_string) # xml_string contains the XML data above
print root.title # returns the book title
for line in root.abstract:print line.para # returns only yhe first paragraph
for word in root.keywordSet:print word.keyword # returns only the first keyword in the set

I tried to follow this example, but the code above doesn't work as expected.

On a different tack, still better would be able to read the entire XML tree into a Python dictionary, with each element as the key and each text as the element item(s). I found out that something like this might be possible using lxml objectify, but I couldn't figure out how to achieve it.

One really big problem I have been finding when attempting to write XML parsing code in Python is that most of the "examples" provided are just too simple and entirely fictitious to be of much help -- or else they are just the opposite, using too complicated automatically-generated XML data!

Could anybody give me a hint?

Thanks in advance!

EDIT: After posting this question, I found a simple solution here.

So, my updated code becomes:

from lxml import objectifyroot = objectify.fromstring(xml_string) # xml_string contains the XML data aboveprint root.title # returns the book titlefor para in root.abstract.iterchildren():print para # now returns the text of all paragraphsfor keyword in root.keywordSet.iterchildren():print keyword # now returns all keywords in the set
Answer

This is pretty simple using XPath:

from lxml import etreetree = etree.parse('data.xml')paragraphs = tree.xpath('/abstract/para/text()')
keywords = tree.xpath('/keywordSet/keyword/text()')print paragraphs
print keywords

Output:

['First paragraph of the abstract', 'Second paragraph of the abstract']
['First keyword', 'Second keyword', 'Third keyword']

See the XPath Tutorial at W3Schools for details on the XPath syntax.

In particular, the elements used in the expressions above use

  • The / selector to select the root node / the immediate children.
  • The text() operator to select the text node (the "textual content") of the respective elements.

Here's how it could be done using the Objectify API:

from lxml import objectifyroot = objectify.fromstring(xml_string)paras = [p.text for p in root.abstract.para]
keywords = [k.text for k in root.keywordSet.keyword]print paras
print keywords

It seems that root.abstract.para is actually shorthand for root.abstract.para[0]. So you need to explicitly use element.iterchildren() to access all child elements.

That's not true, we obviously both misunderstood the Objectify API: In order to iterate over the paras in abstract, you need to iterate over root.abstract.para, not root.abstract itself. It's weird, because you intuitively think about abstract as a collection or a container for its nodes, and that container would be represented by a Python iterable. But it's actually the .para selector that represents the sequence.

https://en.xdnf.cn/q/73179.html

Related Q&A

Easiest way to plot data on country map with python

Could not delete question. Please refer to question: Shade states of a country according to dictionary values with Basemap I want to plot data (number of sick people for a certain year) on each state o…

How to resize QMainWindow after removing all DockWidgets?

I’m trying to make an application consisting of a QMainWindow, the central widget of which is a QToolBar (it may not be usual, but for my purpose the toolbar’s well suited). Docks are allowed below o…

Python: sorting a list by column [duplicate]

This question already has answers here:How to sort a list/tuple of lists/tuples by the element at a given index(11 answers)Closed 8 years ago.How can I sort a list-of-lists by "column", i.e. …

How to make setuptools clone git dependencies recursively?

I want to let setuptools install Phoenix in my project and thus addedsetup(...dependency_links = ["git+https://github.com/wxWidgets/Phoenix.git#egg=Phoenix"],install_requires = ["Phoenix…

Stable sorting in Jinja2

It is possible to apply the sort filter in Jinja2 successively to sort a list first by one attribute, then by another? This seems like a natural thing to do, but in my testing, the preceeding sort is …

Factor to complex roots using sympy

I cant figure out how to factor an polynomial expression to its complex roots.>>> from sympy import * >>> s = symbol(s) >>> factor(s**2+1)2 s + 1

Using multiple custom classes with Pipeline sklearn (Python)

I try to do a tutorial on Pipeline for students but I block. Im not an expert but Im trying to improve. So thank you for your indulgence. In fact, I try in a pipeline to execute several steps in prepar…

Python equivalent of pointers

In python everything works by reference:>>> a = 1 >>> d = {a:a} >>> d[a] 1 >>> a = 2 >>> d[a] 1I want something like this>>> a = 1 >>> d =…

Pip is broken, gives PermissionError: [WinError 32]

I installed the python-certifi-win32 module (Im so busy trying to fix this problem that I dont even remember why I originally installed it). Right after I installed it, though, I started getting this e…

Pandas highlight rows based on index name

I have been struggling with how to style highlight pandas rows based on index names. I know how to highlight selected rows but when I have to highlight based on the index, the code is not working.Setup…