I have a XML document that contains a hierarchical, tree-like structure, see the example below.
The document contains several <Message>
tags (I only copied one of them for convenience).
Each <Message>
has some associated data (id
, status
, priority
) on its own.
Besides, each <Message>
can contain one or more <Street>
children which again have some relevant data (<name>
, <length>
).
Moreover, each <Street>
can have one or more <Link>
children which again have their own relevant data (<id>
, <direction>
).
Example XML document:
<?xml version="1.0" encoding="ISO-8859-1"?>
<Root xmlns="someNamespace">
<Messages>
<Message id='12345'><status>Active</status><priority>Low</priority><Area><Streets><Street><name>King Street</name><length>Short</length><Link><id>75838745</id><direction>North</direction></Link><Link><id>168745</id><direction>South</direction></Link><Link><id>975416</id><direction>North</direction></Link></Street><Street><name>Queen Street</name><length>Long</length><Link><id>366248</id><direction>West</direction></Link><Link><id>745812</id><direction>East</direction></Link></Street></Streets></Area>
</Message>
</Messages>
</Root>
Parsing the XML with Python and storing the relevant data in variables is not the problem - I can use for example the lxml
library and either read the whole document, then perform some xpath
expressions to get the relevant fields, or read it line by line with the iterparse
method.
However, I would like to put the data into a pandas dataframe while preserving the hierarchy in it. The goal is to query for single messages (e.g. by Boolean expressions like if status == Active then get the Message with all its streets and its streets' links
) and get all the data that belongs to the specific message (its streets and its streets' links). How would this best be done?
I tried different approaches but ran into problems with all of them.
If I create one dataframe row for each XML row that contains information and then set a MultiIndex on [MessageID, StreetName, LinkID]
, I get an Index with lots of NaN
in it (which is generally discouraged) because MessageID
does not know its children streets
and links
yet. Besides, I would not know how to select some sub-dataset by Boolean condition instead of only getting some single rows without its children.
When doing a GroupBy on [MessageID, StreetName, LinkID]
, I do not know how to get back a (probably MultiIndex) dataframe from the pandas GroupBy object since there is nothing to aggregate here (no mean/std/sum/whatsoever, the values should stay the same).
Any suggestions how this could be handled efficiently?
I finally managed to solve the problem as described above and this is how.
I extended the above given XML document to include two messages instead of one. This is how it looks as a valid Python string (it could also be loaded from a file of course):
xmlDocument = '''<?xml version="1.0" encoding="ISO-8859-1"?> \
<Root> \
<Messages> \
<Message id='12345'> \<status>Active</status> \<priority>Low</priority> \<Area> \<Streets> \<Street> \<name>King Street</name> \<length>Short</length> \<Link> \<id>75838745</id> \<direction>North</direction> \</Link> \<Link> \<id>168745</id> \<direction>South</direction> \</Link> \<Link> \<id>975416</id> \<direction>North</direction> \</Link> \</Street> \<Street> \<name>Queen Street</name> \<length>Long</length> \<Link> \<id>366248</id> \<direction>West</direction> \</Link> \<Link> \<id>745812</id> \<direction>East</direction> \</Link> \</Street> \</Streets> \</Area> \
</Message> \
<Message id='54321'> \<status>Inactive</status> \<priority>High</priority> \<Area> \<Streets> \<Street> \<name>Princess Street</name> \<length>Mid</length> \<Link> \<id>744154</id> \<direction>West</direction> \</Link> \<Link> \<id>632214</id> \<direction>South</direction> \</Link> \<Link> \<id>654785</id> \<direction>East</direction> \</Link> \</Street> \<Street> \<name>Prince Street</name> \<length>Very Long</length> \<Link> \<id>1022444</id> \<direction>North</direction> \</Link> \<Link> \<id>4474558</id> \<direction>South</direction> \</Link> \</Street> \</Streets> \</Area> \
</Message> \
</Messages> \
</Root>'''
To parse the hierarchical XML structure into a flat pandas dataframe, I used Python's ElementTree iterparse
method which provides a SAX-like interface to iterate through a XML document line by line and fire events if specific XML tags start or end.
To each parsed XML line, the given information is stored in a dictionary. Three dictionaries are used, one for each set of data that somehow belongs together (message, street, link) and that is to be stored in its own dataframe row later on. When all information to one such row is collected, the dictionary is appended to a list storing all rows in their appropriate order.
This is what the XML parsing looks like (see inline comments for further explanation):
# imports
import xml.etree.ElementTree as ET
import pandas as pd# initialize parsing from Bytes buffer
from io import BytesIO
xmlDocument = BytesIO(xmlDocument.encode('utf-8'))# initialize dictionaries storing the information to each type of row
messageRow, streetRow, linkRow = {}, {}, {}# initialize list that stores the single dataframe rows
listOfRows = []# read the xml file line by line and throw signal when specific tags start or end
for event, element in ET.iterparse(xmlDocument, events=('start', 'end')):########### get all information on the current message and store in the appropriate dictionary########### get current message's id attributeif event == 'start' and element.tag == 'Message':messageRow = {} # re-initialize the dictionary for the current rowmessageRow['messageId'] = element.get('id')# get current message's statusif event == 'end' and element.tag == 'status':messageRow['status'] = element.text# get current message's priorityif event == 'end' and element.tag == 'priority':messageRow['priority'] = element.text# when no more information on the current message is expected, append it to the list of rowsif event == 'end' and element.tag == 'priority':listOfRows.append(messageRow)########### get all information on the current street and store in row dictionary##########if event == 'end' and element.tag == 'name':streetRow = {} # re-initialize the dictionary for the current street rowstreetRow['streetName'] = element.textif event == 'end' and element.tag == 'length':streetRow['streetLength'] = element.text# when no more information on the current street is expected, append it to the list of rowsif event == 'end' and element.tag == 'length':# link the street to the message it belongs to, then appendstreetRow['messageId'] = messageRow['messageId']listOfRows.append(streetRow)########### get all information on the current link and store in row dictionary##########if event == 'end' and element.tag == 'id':linkRow = {} # re-initialize the dictionary for the current link rowlinkRow['linkId'] = element.textif event == 'end' and element.tag == 'direction':linkRow['direction'] = element.text# when no more information on the current link is expected, append it to the list of rowsif event == 'end' and element.tag == 'direction':# link the link to the message it belongs to, then appendlinkRow['messageId'] = messageRow['messageId']listOfRows.append(linkRow)
listOfRows
is now a list of dictionaries where each dictionary stores the information that is to be put into one dataframe row. Creating a dataframe with this list as datasource can be done with
# create dataframe from list of rows and pass column order (would be random otherwise)
df = pd.DataFrame.from_records(listOfRows, columns=['messageId', 'status', 'priority', 'streetName', 'streetLength', 'linkId', 'direction'])
print(df)
and gives the "raw" dataframe:
messageId status priority streetName streetLength linkId \
0 12345 Active Low NaN NaN NaN
1 12345 NaN NaN King Street Short NaN
2 12345 NaN NaN NaN NaN 75838745
3 12345 NaN NaN NaN NaN 168745
4 12345 NaN NaN NaN NaN 975416
5 12345 NaN NaN Queen Street Long NaN
6 12345 NaN NaN NaN NaN 366248
7 12345 NaN NaN NaN NaN 745812
8 54321 Inactive High NaN NaN NaN
9 54321 NaN NaN Princess Street Mid NaN
10 54321 NaN NaN NaN NaN 744154
11 54321 NaN NaN NaN NaN 632214
12 54321 NaN NaN NaN NaN 654785
13 54321 NaN NaN Prince Street Very Long NaN
14 54321 NaN NaN NaN NaN 1022444
15 54321 NaN NaN NaN NaN 4474558 direction
0 NaN
1 NaN
2 North
3 South
4 North
5 NaN
6 West
7 East
8 NaN
9 NaN
10 West
11 South
12 East
13 NaN
14 North
15 South
We can now se the columns of interest (messageId, streetName, linkId) as MultiIndex on that dataframe:
# set the columns of interest as MultiIndex
df = df.set_index(['messageId', 'streetName', 'linkId'])
print(df)
which gives:
status priority streetLength direction
messageId streetName linkId
12345 NaN NaN Active Low NaN NaNKing Street NaN NaN NaN Short NaNNaN 75838745 NaN NaN NaN North168745 NaN NaN NaN South975416 NaN NaN NaN NorthQueen Street NaN NaN NaN Long NaNNaN 366248 NaN NaN NaN West745812 NaN NaN NaN East
54321 NaN NaN Inactive High NaN NaNPrincess Street NaN NaN NaN Mid NaNNaN 744154 NaN NaN NaN West632214 NaN NaN NaN South654785 NaN NaN NaN EastPrince Street NaN NaN NaN Very Long NaNNaN 1022444 NaN NaN NaN North4474558 NaN NaN NaN South
Even though having NaN
in an index should be disregarded in general, I don't see any problem with it for this usecase.
Finally, to get the desired effect of accessing single messages by their messageId
, including all of its "children" streets and links, the MultiIndexed dataframe has to be grouped by the most outer index level:
# group by the most outer index
groups = df.groupby(level='messageId')
Now, you can for example loop over all messages (and do whatever with them) with
# iterate over all groups
for key, group in groups:print('key: ' + key)print('group:')print(group)print('\n')
which returns
key: 12345
group:status priority streetLength direction
messageId streetName linkId
12345 NaN NaN Active Low NaN NaNKing Street NaN NaN NaN Short NaNNaN 75838745 NaN NaN NaN North168745 NaN NaN NaN South975416 NaN NaN NaN NorthQueen Street NaN NaN NaN Long NaNNaN 366248 NaN NaN NaN West745812 NaN NaN NaN Eastkey: 54321
group:status priority streetLength direction
messageId streetName linkId
54321 NaN NaN Inactive High NaN NaNPrincess Street NaN NaN NaN Mid NaNNaN 744154 NaN NaN NaN West632214 NaN NaN NaN South654785 NaN NaN NaN EastPrince Street NaN NaN NaN Very Long NaNNaN 1022444 NaN NaN NaN North4474558 NaN NaN NaN South
or you can access specific messages by the messageId, returning the row containing the messageId and also all of its dedicated streets and links:
# get groups by key
print('specific group only:')
print(groups.get_group('54321'))
gives
specific group only:status priority streetLength direction
messageId streetName linkId
54321 NaN NaN Inactive High NaN NaNPrincess Street NaN NaN NaN Mid NaNNaN 744154 NaN NaN NaN West632214 NaN NaN NaN South654785 NaN NaN NaN EastPrince Street NaN NaN NaN Very Long NaNNaN 1022444 NaN NaN NaN North4474558 NaN NaN NaN South
Hope this will be helpful for somebody sometime.