Convert nested XML content into CSV using xml tree in python

2024/11/13 10:24:37

I'm very new to python and please treat me as same. When i tried to convert the XML content into List of Dictionaries I'm getting output but not as expected and tried a lot playing around.

XML Content

<project>
<data><row><respondent>m0wxo5f6w42h3fot34m7s6xij</respondent><timestamp>10-06-16 11:30</timestamp><product>1</product><replica>1</replica><seqnr>1</seqnr><session>1</session><column><question>Q1</question><answer>a1</answer></column><column><question>Q2</question><answer>a2</answer></column></row>
<row><respondent>w42h3fot34m7s6x</respondent><timestamp>10-06-16 11:30</timestamp><product>1</product><replica>1</replica><seqnr>1</seqnr><session>1</session><column><question>Q3</question><answer>a3</answer></column><column><question>Q4</question><answer>a4</answer></column><column><question>Q5</question><answer>a5</answer></column></row>
</data>
</project>

Code i have used:

import xml.etree.ElementTree as ETtree = ET.parse(xml_file.xml)   # import xml from
root = tree.getroot()  
data_list = []for item in root.find('./data'):    # find all projects nodedata = {}              # dictionary to store content of each projectsfor child in item:data[child.tag] = child.text   # add item to dictionary#-----------------for loop with subchild is not working as expcted in my casefor subchild in child:data[subchild.tag] = subchild.textdata_list.append(data)
print(data_list)headers = {k for d in data_list for k in d.keys()} # headers for csv 
with open(csv_file,'w') as f:writer = csv.DictWriter(f, fieldnames = headers)    # creating a DictWriter objectwriter.writeheader()    # write headers to csvwriter.writerows(data_list)

Output for the data_list is getting the last info of question to the list of dictionaries. i guess the issue is at subchild forloop but im not understanding how to append the list with dictionaries.

[{'respondent': 'anonymous_m0wxo5f6w42h3fot34m7s6xij','timestamp': '10-06-16 11:30','product': '1','replica': '1','seqnr': '1','session': '1','column': '\n  ,'question': 'Q2','answer': 'a2'
},
{
'respondent': 'w42h3fot34m7s6x','timestamp': '10-06-16 11:30','product': '1','replica': '1','seqnr': '1','session': '1','column': '\n ,'question': 'Q2','answer': 'a2'
}.......
]

I expect the below output, tried a lot but unable to loop over the column tag.

[{'respondent': 'anonymous_m0wxo5f6w42h3fot34m7s6xij','timestamp': '10-06-16 11:30','product': '1','replica': '1','seqnr': '1','session': '1','question': 'Q1','answer': 'a1'},{'respondent': 'anonymous_m0wxo5f6w42h3fot34m7s6xij','timestamp': '10-06-16 11:30','product': '1','replica': '1','seqnr': '1','session': '1','question': 'Q2','answer': 'a2'},{'respondent': 'w42h3fot34m7s6x','timestamp': '10-06-16 11:30','product': '1','replica': '1','seqnr': '1','session': '1','question': 'Q3','answer': 'a3'},{'respondent': 'w42h3fot34m7s6x','timestamp': '10-06-16 11:30','product': '1','replica': '1','seqnr': '1','session': '1','question': 'Q4','answer': 'a4'},{'respondent': 'w42h3fot34m7s6x','timestamp': '10-06-16 11:30','product': '1','replica': '1','seqnr': '1','session': '1','question': 'Q5','answer': 'a5'}
]

I have refereed so many stack overflow questions on xml tree but still didn't helped me.

any help/suggestion is appreciated.

Answer

I had a problem understanding what this code is supposed to do because it uses abstract variable names like item, child, subchild and this makes it hard to reason about the code. I'm not as clever as that, so I renamed the variables to row, tag, and column to make it easier for me to see what the code is doing. (In my book, even row and column are a bit abstract, but I suppose the opacity of the XML input is hardly your fault.)

You have 2 rows but you want 5 dictionaries, because you have 5 <column> tags and you want each <column>'s data in a separate dictionary. But you want the other tags in the <row> to be repeated along with each <column>'s data.

That means you need to build a dictionary for every <row>, then, for each <column>, add that column's data to the dictionary, then output it before going on to the next column.

This code makes the simplifying assumption that all of your <columns>s have the same structure, with exactly one <question> and exactly one <answer> and nothing else. If this assumption does not hold then a <column> may get reported with stale data it inherited from the previous <column> in the same row. It will also produce no output at all for any <row> that does not have at least one <column>.

The code has to loop through the tags twice, once for the non-<column>s and once for the <column>s. Otherwise it can't be sure it has seen all the non-<column> tags before it starts outputting the <column>s.

There are other (no doubt more elegant) ways to do this, but I kept the code structure as close to your original as I could, other than making the variable names less opaque.

for row in root.find('./data'):    # find all projects nodedata = {}              # dictionary to store content of each projectsfor tag in row:if tag.tag != "column":data[tag.tag] = tag.text   # add row to dictionary# Now the dictionary data is built for the row levelfor tag in row:if tag.tag == "column":for column in tag:data[column.tag] = column.text# Now we have added the column level data for one column tagdata_list.append(data.copy())

Output is as below. The key order of the dicts isn't preserved because I used pprint.pprint for convenience.

[{'answer': 'a1','product': '1','question': 'Q1','replica': '1','respondent': 'm0wxo5f6w42h3fot34m7s6xij','seqnr': '1','session': '1','timestamp': '10-06-16 11:30'},{'answer': 'a2','product': '1','question': 'Q2','replica': '1','respondent': 'm0wxo5f6w42h3fot34m7s6xij','seqnr': '1','session': '1','timestamp': '10-06-16 11:30'},{'answer': 'a3','product': '1','question': 'Q3','replica': '1','respondent': 'w42h3fot34m7s6x','seqnr': '1','session': '1','timestamp': '10-06-16 11:30'},{'answer': 'a4','product': '1','question': 'Q4','replica': '1','respondent': 'w42h3fot34m7s6x','seqnr': '1','session': '1','timestamp': '10-06-16 11:30'},{'answer': 'a5','product': '1','question': 'Q5','replica': '1','respondent': 'w42h3fot34m7s6x','seqnr': '1','session': '1','timestamp': '10-06-16 11:30'}]
https://en.xdnf.cn/q/119492.html

Related Q&A

How to decode binary file with for index, line in enumerate(file)?

I am opening up an extremely large binary file I am opening in Python 3.5 in file1.py:with open(pathname, rb) as file:for i, line in enumerate(file):# parsing hereHowever, I naturally get an error beca…

how to install pyshpgeocode from git [duplicate]

This question already has answers here:The unauthenticated git protocol on port 9418 is no longer supported(10 answers)Closed 2 years ago.I would like to install the following from Git https://github.c…

How to export dictionary as CSV using Python?

I am having problems exporting certain items in a dictionary to CSV. I can export name but not images (the image URL).This is an example of part of my dictionary: new = [{ "name" : "pete…

Passing values to a function from within a function in python

I need to pass values from one function to the next from within the function.For example (my IRC bot programmed to respond to commands in the channel):def check_perms(nick,chan,cmd):sql = "SELECT …

How to make Stop button to terminate start function already running in Tkinter (Python)

I am making a GUI using Tkinter with two main buttons: "Start" and "Stop". Could you, please, advise on how to make the "Stop" button to terminate the already running func…

adding language to markdown codeblock in bulk

My Problem is to add to every single block of code a language in my markdown files. Ive hundreds of files in nested directories. The files have this form: ```language a ```Normal text``` b ```Normal te…

Cant randomize list with classes inside of it Python 2.7.4

I am new to coding and I need some help. Im trying to randomize these rooms or scenes in a text adventure but whenever I try to randomize it they dont even show up when I run it! Here is the script:fro…

calculate the queue for orders based on creation and delivery date, by product group

I have a Pandas dataframe containing records for a lot of orders, one recorde for each order. Each record has order_id, category_id, created_at and picked_at. I need to calculate queue length for each …

Python print with string invalid syntax

I have a rock, paper, scissors code Ive been working on lately (yes, I am a total noob at coding), and I get an Invalid Syntax error with this specific line:print(The magical 8ball reads "Your for…

How to load images and text labels for CNN regression from different folders

I have two folders, X_train and Y_train. X_train is images, Y_train is vector and .txt files. I try to train CNN for regression. I could not figure out how to take data and train the network. When i us…