Read hierarchical (tree-like) XML into a pandas dataframe, preserving hierarchy

2024/11/14 15:27:22

I have a XML document that contains a hierarchical, tree-like structure, see the example below.

The document contains several <Message> tags (I only copied one of them for convenience).

Each <Message> has some associated data (id, status, priority) on its own.

Besides, each <Message> can contain one or more <Street> children which again have some relevant data (<name>, <length>).

Moreover, each <Street> can have one or more <Link> children which again have their own relevant data (<id>, <direction>).

Example XML document:

<?xml version="1.0" encoding="ISO-8859-1"?>
<Root xmlns="someNamespace">
<Messages>
<Message id='12345'><status>Active</status><priority>Low</priority><Area><Streets><Street><name>King Street</name><length>Short</length><Link><id>75838745</id><direction>North</direction></Link><Link><id>168745</id><direction>South</direction></Link><Link><id>975416</id><direction>North</direction></Link></Street><Street><name>Queen Street</name><length>Long</length><Link><id>366248</id><direction>West</direction></Link><Link><id>745812</id><direction>East</direction></Link></Street></Streets></Area>
</Message>
</Messages>
</Root>

Parsing the XML with Python and storing the relevant data in variables is not the problem - I can use for example the lxml library and either read the whole document, then perform some xpath expressions to get the relevant fields, or read it line by line with the iterparse method.

However, I would like to put the data into a pandas dataframe while preserving the hierarchy in it. The goal is to query for single messages (e.g. by Boolean expressions like if status == Active then get the Message with all its streets and its streets' links) and get all the data that belongs to the specific message (its streets and its streets' links). How would this best be done?

I tried different approaches but ran into problems with all of them.

If I create one dataframe row for each XML row that contains information and then set a MultiIndex on [MessageID, StreetName, LinkID], I get an Index with lots of NaN in it (which is generally discouraged) because MessageID does not know its children streets and links yet. Besides, I would not know how to select some sub-dataset by Boolean condition instead of only getting some single rows without its children.

When doing a GroupBy on [MessageID, StreetName, LinkID], I do not know how to get back a (probably MultiIndex) dataframe from the pandas GroupBy object since there is nothing to aggregate here (no mean/std/sum/whatsoever, the values should stay the same).

Any suggestions how this could be handled efficiently?

Answer

I finally managed to solve the problem as described above and this is how.

I extended the above given XML document to include two messages instead of one. This is how it looks as a valid Python string (it could also be loaded from a file of course):

xmlDocument = '''<?xml version="1.0" encoding="ISO-8859-1"?> \
<Root> \
<Messages> \
<Message id='12345'> \<status>Active</status> \<priority>Low</priority> \<Area> \<Streets> \<Street> \<name>King Street</name> \<length>Short</length> \<Link> \<id>75838745</id> \<direction>North</direction> \</Link> \<Link> \<id>168745</id> \<direction>South</direction> \</Link> \<Link> \<id>975416</id> \<direction>North</direction> \</Link> \</Street> \<Street> \<name>Queen Street</name> \<length>Long</length> \<Link> \<id>366248</id> \<direction>West</direction> \</Link> \<Link> \<id>745812</id> \<direction>East</direction> \</Link> \</Street> \</Streets> \</Area> \
</Message> \
<Message id='54321'> \<status>Inactive</status> \<priority>High</priority> \<Area> \<Streets> \<Street> \<name>Princess Street</name> \<length>Mid</length> \<Link> \<id>744154</id> \<direction>West</direction> \</Link> \<Link> \<id>632214</id> \<direction>South</direction> \</Link> \<Link> \<id>654785</id> \<direction>East</direction> \</Link> \</Street> \<Street> \<name>Prince Street</name> \<length>Very Long</length> \<Link> \<id>1022444</id> \<direction>North</direction> \</Link> \<Link> \<id>4474558</id> \<direction>South</direction> \</Link> \</Street> \</Streets> \</Area> \
</Message> \
</Messages> \
</Root>'''

To parse the hierarchical XML structure into a flat pandas dataframe, I used Python's ElementTree iterparse method which provides a SAX-like interface to iterate through a XML document line by line and fire events if specific XML tags start or end.

To each parsed XML line, the given information is stored in a dictionary. Three dictionaries are used, one for each set of data that somehow belongs together (message, street, link) and that is to be stored in its own dataframe row later on. When all information to one such row is collected, the dictionary is appended to a list storing all rows in their appropriate order.

This is what the XML parsing looks like (see inline comments for further explanation):

# imports
import xml.etree.ElementTree as ET
import pandas as pd# initialize parsing from Bytes buffer
from io import BytesIO
xmlDocument = BytesIO(xmlDocument.encode('utf-8'))# initialize dictionaries storing the information to each type of row
messageRow, streetRow, linkRow = {}, {}, {}# initialize list that stores the single dataframe rows
listOfRows = []# read the xml file line by line and throw signal when specific tags start or end
for event, element in ET.iterparse(xmlDocument, events=('start', 'end')):########### get all information on the current message and store in the appropriate dictionary########### get current message's id attributeif event == 'start' and element.tag == 'Message':messageRow = {} # re-initialize the dictionary for the current rowmessageRow['messageId'] = element.get('id')# get current message's statusif event == 'end' and element.tag == 'status':messageRow['status'] = element.text# get current message's priorityif event == 'end' and element.tag == 'priority':messageRow['priority'] = element.text# when no more information on the current message is expected, append it to the list of rowsif event == 'end' and element.tag == 'priority':listOfRows.append(messageRow)########### get all information on the current street and store in row dictionary##########if event == 'end' and element.tag == 'name':streetRow = {} # re-initialize the dictionary for the current street rowstreetRow['streetName'] = element.textif event == 'end' and element.tag == 'length':streetRow['streetLength'] = element.text# when no more information on the current street is expected, append it to the list of rowsif event == 'end' and element.tag == 'length':# link the street to the message it belongs to, then appendstreetRow['messageId'] = messageRow['messageId']listOfRows.append(streetRow)########### get all information on the current link and store in row dictionary##########if event == 'end' and element.tag == 'id':linkRow = {} # re-initialize the dictionary for the current link rowlinkRow['linkId'] = element.textif event == 'end' and element.tag == 'direction':linkRow['direction'] = element.text# when no more information on the current link is expected, append it to the list of rowsif event == 'end' and element.tag == 'direction':# link the link to the message it belongs to, then appendlinkRow['messageId'] = messageRow['messageId']listOfRows.append(linkRow)

listOfRows is now a list of dictionaries where each dictionary stores the information that is to be put into one dataframe row. Creating a dataframe with this list as datasource can be done with

# create dataframe from list of rows and pass column order (would be random otherwise)
df = pd.DataFrame.from_records(listOfRows, columns=['messageId', 'status', 'priority', 'streetName', 'streetLength', 'linkId', 'direction'])
print(df)

and gives the "raw" dataframe:

   messageId    status priority       streetName streetLength    linkId  \
0      12345    Active      Low              NaN          NaN       NaN   
1      12345       NaN      NaN      King Street        Short       NaN   
2      12345       NaN      NaN              NaN          NaN  75838745   
3      12345       NaN      NaN              NaN          NaN    168745   
4      12345       NaN      NaN              NaN          NaN    975416   
5      12345       NaN      NaN     Queen Street         Long       NaN   
6      12345       NaN      NaN              NaN          NaN    366248   
7      12345       NaN      NaN              NaN          NaN    745812   
8      54321  Inactive     High              NaN          NaN       NaN   
9      54321       NaN      NaN  Princess Street          Mid       NaN   
10     54321       NaN      NaN              NaN          NaN    744154   
11     54321       NaN      NaN              NaN          NaN    632214   
12     54321       NaN      NaN              NaN          NaN    654785   
13     54321       NaN      NaN    Prince Street    Very Long       NaN   
14     54321       NaN      NaN              NaN          NaN   1022444   
15     54321       NaN      NaN              NaN          NaN   4474558   direction  
0        NaN  
1        NaN  
2      North  
3      South  
4      North  
5        NaN  
6       West  
7       East  
8        NaN  
9        NaN  
10      West  
11     South  
12      East  
13       NaN  
14     North  
15     South  

We can now se the columns of interest (messageId, streetName, linkId) as MultiIndex on that dataframe:

# set the columns of interest as MultiIndex
df = df.set_index(['messageId', 'streetName', 'linkId'])
print(df)

which gives:

                                      status priority streetLength direction
messageId streetName      linkId                                            
12345     NaN             NaN         Active      Low          NaN       NaNKing Street     NaN            NaN      NaN        Short       NaNNaN             75838745       NaN      NaN          NaN     North168745         NaN      NaN          NaN     South975416         NaN      NaN          NaN     NorthQueen Street    NaN            NaN      NaN         Long       NaNNaN             366248         NaN      NaN          NaN      West745812         NaN      NaN          NaN      East
54321     NaN             NaN       Inactive     High          NaN       NaNPrincess Street NaN            NaN      NaN          Mid       NaNNaN             744154         NaN      NaN          NaN      West632214         NaN      NaN          NaN     South654785         NaN      NaN          NaN      EastPrince Street   NaN            NaN      NaN    Very Long       NaNNaN             1022444        NaN      NaN          NaN     North4474558        NaN      NaN          NaN     South

Even though having NaN in an index should be disregarded in general, I don't see any problem with it for this usecase.

Finally, to get the desired effect of accessing single messages by their messageId, including all of its "children" streets and links, the MultiIndexed dataframe has to be grouped by the most outer index level:

# group by the most outer index
groups = df.groupby(level='messageId')

Now, you can for example loop over all messages (and do whatever with them) with

# iterate over all groups
for key, group in groups:print('key: ' + key)print('group:')print(group)print('\n')

which returns

key: 12345
group:status priority streetLength direction
messageId streetName   linkId                                          
12345     NaN          NaN       Active      Low          NaN       NaNKing Street  NaN          NaN      NaN        Short       NaNNaN          75838745     NaN      NaN          NaN     North168745       NaN      NaN          NaN     South975416       NaN      NaN          NaN     NorthQueen Street NaN          NaN      NaN         Long       NaNNaN          366248       NaN      NaN          NaN      West745812       NaN      NaN          NaN      Eastkey: 54321
group:status priority streetLength direction
messageId streetName      linkId                                           
54321     NaN             NaN      Inactive     High          NaN       NaNPrincess Street NaN           NaN      NaN          Mid       NaNNaN             744154        NaN      NaN          NaN      West632214        NaN      NaN          NaN     South654785        NaN      NaN          NaN      EastPrince Street   NaN           NaN      NaN    Very Long       NaNNaN             1022444       NaN      NaN          NaN     North4474558       NaN      NaN          NaN     South

or you can access specific messages by the messageId, returning the row containing the messageId and also all of its dedicated streets and links:

# get groups by key
print('specific group only:')
print(groups.get_group('54321'))

gives

specific group only:status priority streetLength direction
messageId streetName      linkId                                           
54321     NaN             NaN      Inactive     High          NaN       NaNPrincess Street NaN           NaN      NaN          Mid       NaNNaN             744154        NaN      NaN          NaN      West632214        NaN      NaN          NaN     South654785        NaN      NaN          NaN      EastPrince Street   NaN           NaN      NaN    Very Long       NaNNaN             1022444       NaN      NaN          NaN     North4474558       NaN      NaN          NaN     South

Hope this will be helpful for somebody sometime.

https://en.xdnf.cn/q/72030.html

Related Q&A

Reference counting using PyDict_SetItemString

Im wondering how memory management/reference counting works when a new value is set into an existing field within a PyDict (within a C extension).For instance, assume a dictionary is created and popula…

How to use statsmodels.tsa.seasonal.seasonal_decompose with a pandas dataframe

from statsmodels.tsa.seasonal import seasonal_decomposedef seasonal_decomp(df, model="additive"):seasonal_df = Noneseasonal_df = seasonal_decompose(df, model=additive)return seasonal_dfseason…

UnidentifiedImageError: cannot identify image file

Hello I am training a model with TensorFlow and Keras, and the dataset was downloaded from https://www.microsoft.com/en-us/download/confirmation.aspx?id=54765 This is a zip folder that I split in the …

Pydub raw audio data

Im using Pydub in Python 3.4 to try to detect the pitch of some audio files.I have a working pitch detection algorithm (McLeod Pitch Method), which is robust for real-time applications (I even made an …

Create Duplicate Rows and Change Values in Specific Columns

How to create x amount of duplicates based on a row in the dataframe and change a single or multi variables from specific columns. The rows are then added to the end of the same dataframe.A B C D E F 0…

writing and saving CSV file from scraping data using python and Beautifulsoup4

I am trying to scrape data from the PGA.com website to get a table of all of the golf courses in the United States. In my CSV table I want to include the Name of the golf course ,Address ,Ownership ,We…

Performance issue turning rows with start - end into a dataframe with TimeIndex

I have a large dataset where each line represents the value of a certain type (think a sensor) for a time interval (between start and end). It looks like this: start end type value 2015-01-01…

How can I create a key using RSA/ECB/PKCS1Padding in python?

I am struggling to find any method of using RSA in ECB mode with PKCS1 padding in python. Ive looked into pyCrypto, but they dont have PKCS1 padding in the master branch (but do in a patch). Neverthel…

Do full-outer-join with pandas.merge_asof

Hi I need to align some time series data with nearest timestamps, so I think pandas.merge_asof could be a good candidate. However, it does not have an option to set how=outer like in the standard merge…

order of calling constructors in Python

#!/usr/bin/pythonclass Parent(object): # define parent classparentAttr = 100def __init__(self):print "Calling parent constructor"def parentMethod(self):print Calling parent methoddef s…