xml filtering with python

2024/11/17 11:27:56

I have a following xml document:

<node0><node1><node2 a1="x1"> ... </node2><node2 a1="x2"> ... </node2><node2 a1="x1"> ... </node2></node1>
</node0>

I want to filter out node2 when a1="x2". The user provides the xpath and attribute values that need to tested and filtered out. I looked at some solutions in python like BeautifulSoup but they are too complicated and dont preserve the case of text. I want to keep the document same as before with some stuff filtered out.

Can you recommend a simple and succinct solution? This should not be too complicated from the looks of it. The actual xml document is not as simple as above but idea is the same.

Answer

This uses xml.etree.ElementTree which is in the standard library:

import xml.etree.ElementTree as xee
data='''\
<node1><node2 a1="x1"> ... </node2><node2 a1="x2"> ... </node2><node2 a1="x1"> ... </node2>
</node1>
'''
doc=xee.fromstring(data)for tag in doc.findall('node2'):if tag.attrib['a1']=='x2':doc.remove(tag)
print(xee.tostring(doc))
# <node1>
#   <node2 a1="x1"> ... </node2>
#   <node2 a1="x1"> ... </node2>
# </node1>

This uses lxml, which is not in the standard library, but has a more powerful syntax:

import lxml.etree
data='''\
<node1><node2 a1="x1"> ... </node2><node2 a1="x2"> ... </node2><node2 a1="x1"> ... </node2>
</node1>
'''
doc = lxml.etree.XML(data)
e=doc.find('node2/[@a1="x2"]')
doc.remove(e)
print(lxml.etree.tostring(doc))# <node1>
#   <node2 a1="x1"> ... </node2>
#   <node2 a1="x1"> ... </node2>
# </node1>

Edit: If node2 is buried more deeply in the xml, then you can iterate through all the tags, check each parent tag to see if the node2 element is one of its children, and the remove it if so:

Using only xml.etree.ElementTree:

doc=xee.fromstring(data)
for parent in doc.getiterator():for child in parent.findall('node2'):if child.attrib['a1']=='x2':parent.remove(child)

Using lxml:

doc = lxml.etree.XML(data)
for parent in doc.iter('*'):child=parent.find('node2/[@a1="x2"]')if child is not None:parent.remove(child)
https://en.xdnf.cn/q/71218.html

Related Q&A

What it really is @client.event? discord.py

A few days ago I became interested in programming discord bots a bit. In the syntax of these programs I noticed a lot of unintelligible issues that I can not find an answer to. Thats why I am asking y…

How to customize virtualenv shell prompt

How do you define a custom prompt to use when activating a Python virtual environment?I have a bash script for activating a virtualenv I use when calling specific Fabric commands. I want the shell pro…

How to get the percent change of values in a dataframe while caring about NaN values?

I have the following DataFrame:Date A 2015-01-01 10 2015-01-02 14 2015-01-05 NaN 2015-01-06 …

Convert CSV to YAML, with Unicode?

Im trying to convert a CSV file, containing Unicode strings, to a YAML file using Python 3.4.Currently, the YAML parser escapes my Unicode text into an ASCII string. I want the YAML parser to export t…

Why is the divide and conquer method of computing factorials so fast for large ints? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, argum…

Python calculate speed, distance, direction from 2 GPS coordinates

How do I calculate the speed, distance and direction (degrees) from 2 GPS coordinates in Python? Each point has lat, long, time.I found the Haversine distance calculation in this post:Calculate dista…

Installed gunicorn but it is not in venv/bin folder

Im new to gunicorn and trying to deploy a django website on an ubuntu. I have used: pip3 install gunicorn sudo apt-get install gunicornbut when I want to fill this file:sudo nano /etc/systemd/system/g…

Does Pythons asyncio lock.acquire maintain order?

If I have two functions doingasync with mylock.acquire():....Once the lock is released, is it guaranteed that the first to await will win, or is the order selected differently? (e.g. randomly, arbitra…

Howto ignore specific undefined variables in Pydev Eclipse

Im writing a crossplatform python script on windows using Eclipse with the Pydev plugin. The script makes use of the os.symlink() and os.readlink() methods if the current platform isnt NT. Since the os…

Faster way to calculate hexagon grid coordinates

Im using the following procedure to calculate hexagonal polygon coordinates of a given radius for a square grid of a given extent (lower left upper right):def calc_polygons(startx, starty, endx, endy,…