I have the following code:
from xml.etree import ElementTreefile_path = 'some_file_path'document = ElementTree.parse(file_path, ElementTree.XMLParser(encoding='utf-8'))
If my XML looks like the following it gives me the error: "xml.etree.ElementTree.ParseError: not well-formed"
<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1">
<textbox id="0">
<textline bbox="53.999,778.980,130.925,789.888">
<text font="GCCBBY+TT228t00" bbox="60.598,778.980,64.594,789.888" size="10.908">H</text>
<text font="GCCBBY+TT228t00" bbox="64.558,778.980,70.558,789.888" size="10.908">-</text>
<text>
</text>
</textline>
</textbox>
</page>
</pages>
In sublime or Notepad++ I see highlighted characters such as ACK, DC4, or STX which seem to be the culprit (one of them appears as a "-" in the above xml in the second "text" node). If I remove these characters it works. What are these and how can I fix this?
Running your code as follows, and it's working fine:
from xml.etree import ElementTree
from StringIO import StringIO xml_content = """<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1">
<textbox id="0">
<textline bbox="53.999,778.980,130.925,789.888">
<text font="GCCBBY+TT228t00" bbox="60.598,778.980,64.594,789.888" size="10.908">H</text>
<text font="GCCBBY+TT228t00" bbox="64.558,778.980,70.558,789.888" size="10.908">-</text>
<text>
</text>
</textline>
</textbox>
</page>
</pages>"""print("parsing xml document")
# using StringIO to simulate reading from file
document = ElementTree.parse(StringIO(xml_content), ElementTree.XMLParser(encoding='utf-8')) for elem in document.iter():print(elem.tag)
And the output is as expected:
parsing xml document
pages
page
textbox
textline
text
text
text
So, the issue is how you are copying and pasting your file from notepad++, maybe it's adding some special characters so try with another editor.