Let's suppose I have the following XML structure:
<?xml version="1.0" encoding="utf-8" ?>
<Document><CstmrCdtTrfInitn><GrpHdr><other_tags>a</other_tags> <!--here there might be other nested tags inside <other_tags></other_tags>--><other_tags>b</other_tags> <!--here there might be other nested tags inside <other_tags></other_tags>--><other_tags>c</other_tags> <!--here there might be other nested tags inside <other_tags></other_tags>--></GrpHdr><PmtInf><things>d</things> <!--here there might be other nested tags inside <things></things>--><things>e</things> <!--here there might be other nested tags inside <things></things>--><CdtTrfTxInf><!-- other nested tags here --></CdtTrfTxInf></PmtInf><PmtInf><things>f</things> <!--here there might be other nested tags inside <things></things>--><things>g</things> <!--here there might be other nested tags inside <things></things>--><CdtTrfTxInf><!-- other nested tags here --></CdtTrfTxInf></PmtInf><PmtInf><things>f</things> <!--here there might be other nested tags inside <things></things>--><things>g</things> <!--here there might be other nested tags inside <things></things>--><CdtTrfTxInf><!-- other nested tags here --></CdtTrfTxInf></PmtInf></CstmrCdtTrfInitn>
</Document>
Now, given this structure, I want to manipulate the sections as follows:
If there are two or more <PmtInf>
tags that have the same:
<things>d</things> <!--here there might be other nested tags inside <things></things>-->
<things>e</things> <!--here there might be other nested tags inside <things></things>-->
I would like to move the whole <CdtTrfTxInf></CdtTrfTxInf>
to the first <PmtInf></PmtInf>
and remove the whole <PmtInf></PmtInf>
that I've taken <CdtTrfTxInf></CdtTrfTxInf>
from. A bit, fuzzy, right ? Here is an example:
<Document><CstmrCdtTrfInitn><GrpHdr><other_tags>a</other_tags> <!--here there might be other nested tags inside <other_tags></other_tags>--><other_tags>b</other_tags> <!--here there might be other nested tags inside <other_tags></other_tags>--><other_tags>c</other_tags> <!--here there might be other nested tags inside <other_tags></other_tags>--></GrpHdr><PmtInf><things>d</things> <!--here there might be other nested tags inside <things></things>--><things>e</things> <!--here there might be other nested tags inside <things></things>--><CdtTrfTxInf><!-- other nested tags here --></CdtTrfTxInf></PmtInf><PmtInf><things>f</things> <!--here there might be other nested tags inside <things></things>--><things>g</things> <!--here there might be other nested tags inside <things></things>--><CdtTrfTxInf><!-- other nested tags here --></CdtTrfTxInf><CdtTrfTxInf><!-- other nested tags here --></CdtTrfTxInf></PmtInf></CstmrCdtTrfInitn>
</Document>
As you can see, the last two <PmtInf></PmtInf>
tags became now a single one (because <things></matched>
) and the <CdtTrfTxInf></CdtTrfTxInf>
was copied.
Now, I would like to do this in any possible way (lxml
, xml.etree
, xslt
etc). At first, I thought about using some RegEx to do this, but it might become a bit ugly. Then, I thought I might be able to use some string manipulations but I can't figure a way of how would I do this.
Can somebody tell me what method would be the most elegant / efficient one if the average size of an XML file would be about 2k lines ? An example would also be kindly appreciated.
For the sake of completness, I'll define a function which will return the entire XML content in a string:
def get_xml_from(some_file):with open(some_file) as xml_file:content = xml_file.read()return contentdef modify_xml(some_file):content_of_xml = get_xml_from(some_file)# here I should be able to process the XML filereturn processed_xml
I'm not looking for somebody doing this for me, but asking for ideas on what are the best ways of achieving this.