python-2.7lxmlelementtreec14n

Extract portion of an xml without using tostring in python


Say I have an XML code like this one:

<a>
 <na:Data xmlns:na="http://some_site.com#" Ref="http://another_site.com" 
  Key="value">
  <b>
  <c>some_c_attrib</c>
  <d>some_d_attrib</d>
  <e>some_e_attrib</e>
   <f>some_f_attrib</f>
   <g>some_g_attrib</g>
  </b>
  <h>
   <i>some_i_attrib</i>
   <j>some_j_attrib</j>
  </h>
 </na:Data>
 <da:Newtag xmlns:da="http://new_site.com">
  <k name="http://new_new_site.com"/>

And there are few more lines after this. I have parsed the xml using ET.parse(FILENAME) and then wrote it in a new file using write_c14n("new.xml"). I now want to extract portion of this new.xml into another xml file where I just want the portion starting from <na:Data xmlns:na="http://some_site.com#" Ref="http://another_site.com" Key="value"> and ending at </h>.

However, I do not want to use tostring() as it doesn't retain the canonicalization of the xml as obtained by using write_c14n(). I was wondering if copying just that portion from new.xml and writing it into another xml will help but I guess it is adding some extra new lines in between and also not retaining the format of the xml as it is.

I have tried out the following ways:

In this way I tried creating another xml with new root as <na:Data xmlns:na="http://some_site.com#" Ref="http://another_site.com" Key="value">:

from lxml import etree
from io import StringIO, BytesIO
import xml.etree.ElementTree as et
import xml.etree.ElementTree as xml
from xml.etree import ElementTree as ET

tree = etree.parse('file_location/file_to_read.xml')
root = tree.getroot()

sub_root = etree.Element('{http://some_site.com#}Data')
for node in root.find('.//na:Data', namespaces = {'na':'http://some_site.com#'}).getchildren():


    sub_root.append(node.element)

new_tree = etree.ElementTree(sub_root)

I just need the object of the new_tree so I can use it as new_tree. However, if I print the above new_tree using tostring() [i.e print etree.tostring(root_tree,pretty_print=True)] this is the output I get:

<ns0:Data xmlns:ns0="http://some_site.com#"><b>
 <c>some_c_attrib</c>
 <d>some_d_attrib</d>
 <e>some_e_attrib</e>
  <f>some_f_attrib</f>
  <g>some_g_attrib</g>
 </b>
 <h>
  <i>some_i_attrib</i>
  <j>some_j_attrib</j>
 </h>
</ns0:Data>

As you can see the na:Data got replaced by ns0:Data and also its keys and values (Ref="http://another_site.com" Key="value") are missing. I need a way which can extract a portion of xml as it is with all the attributes,keys and values.


Solution

  • There is no need to create new elements. Just parse the original XML file, extract the na:Data child element, and write it to a new file.

    from lxml import etree
    
    tree = etree.parse('file_location/file_to_read.xml')
    Data = tree.find('.//na:Data', namespaces={'na':'http://some_site.com#'})
    etree.ElementTree(Data).write_c14n("new.xml")