Using Pythons xml.etree to find element start and end character offsets

2024/10/5 9:30:54

I have XML data that looks like:

<xml>
The captial of <place pid="1">South Africa</place> is <place>Pretoria</place>.
</xml>

I would like to be able to extract:

  1. The XML elements as they're currently provided in etree.
  2. The full plain text of the document, between the start and end tags.
  3. The location within the plain text of each start element, as a character offset.

(3) is the most important requirement right now; etree provides (1) fine.

I cannot see any way to do (3) directly, but hoped that iterating through the elements in the document tree would return many small string that could be re-assembled, thus providing (2) and (3). However, requesting the .text of the root node only returns text between the root node and the first element, e.g. "The capital of ".

Doing (1) with SAX could involve implementing a lot that's already been written many times over, in e.g. minidom and etree. Using lxml isn't an option for the package that this code is to go into. Can anybody help?

Answer

iterparse() function is available in xml.etree:

import xml.etree.cElementTree as etreefor event, elem in etree.iterparse(file, events=('start', 'end')):if event == 'start':print(elem.tag) # use only tag name and attributes hereelif event == 'end':# elem children elements, elem.text, elem.tail are availableif elem.text is not None and elem.tail is not None:print(repr(elem.tail))

Another option is to override start(), data(), end() methods of etree.TreeBuilder():

from xml.etree.ElementTree import XMLParser, TreeBuilderclass MyTreeBuilder(TreeBuilder):def start(self, tag, attrs):print("&lt;%s>" % tag)return TreeBuilder.start(self, tag, attrs)def data(self, data):print(repr(data))TreeBuilder.data(self, data)def end(self, tag):return TreeBuilder.end(self, tag)text = """<xml>
The captial of <place pid="1">South Africa</place> is <place>Pretoria</place>.
</xml>"""# ElementTree.fromstring()
parser = XMLParser(target=MyTreeBuilder())
parser.feed(text)
root = parser.close() # return an ordinary Element

Output

<xml>
'\nThe captial of '
<place>
'South Africa'
' is '
<place>
'Pretoria'
'.\n'
https://en.xdnf.cn/q/70499.html

Related Q&A

How to get public key using PyOpenSSL?

Im tring to create python script, that would take PKCS#12 package and print some information contained in x509 certificate and using for this purpouses PyOpenSSL module. So far i want to fetch from cer…

what is the best way to extract data from pdf

I have thousands of pdf file that I need to extract data from.This is an example pdf. I want to extract this information from the example pdf.I am open to nodejs, python or any other effective method. …

Get random key:value pairs from dictionary in python

Im trying to pull out a random set of key-value pairs from a dictionary I made from a csv file. The dictionary contains information for genes, with the gene name being the dictionary key, and a list of…

UnicodeDecodeError: ascii codec cant decode byte 0xc5

UnicodeDecodeError: ascii codec cant decode byte 0xc5 in position 537: ordinal not in range(128), referer: ...I always get this error when I try to output my whole website with characters "č"…

wpa-handshake with python - hashing difficulties

I try to write a Python program which calculates the WPA-handshake, but I have problems with the hashes. For comparison I installed cowpatty (to see where I start beeing wrong).My PMK-generation works …

Group by column in pandas dataframe and average arrays

I have a movie dataframe with movie names, their respective genre, and vector representation (numpy arrays).ID Year Title Genre Word Vector 1 2003.0 Dinosaur Planet Documentary [-0.55423898,…

Python dynamic properties and mypy

Im trying to mask some functions as properties (through a wrapper which is not important here) and add them to the object dynamically, however, I need code completion and mypy to work.I figured out how…

Flask-login: remember me not working if login_managers session_protection is set to strong

i am using flask-login to integrate session management in my flask app. But the remember me functionality doesnt work if i set the session_protection to strong, however, it works absolutely fine if its…

Does any magic happen when I call `super(some_cls)`?

While investigating this question, I came across this strange behavior of single-argument super:Calling super(some_class).__init__() works inside of a method of some_class (or a subclass thereof), but …

How to get unpickling to work with iPython?

Im trying to load pickled objects in iPython.The error Im getting is:AttributeError: FakeModule object has no attribute WorldAnybody know how to get it to work, or at least a workaround for loading obj…