pythonxmlpandasinformation-extraction

Extracting comments from XML file in Python


I would like to extract the comment section of the XML file. The information that I would like to extract is found between the Tag and then within Text tag which is "EXAMPLE".

The structure of the XML file looks below.

<Boxes>

  <Box Id="3" ZIndex="13">
      <Shape>Rectangle</Shape>
      <Brush Id="0" />
      <Pen>
        <Color>#FF000000</Color>

      </Pen>
      <Tag>&lt;?xml version="1.0"?&gt;
&lt;PFDComment xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"&gt;
  &lt;Text&gt;**EXAMPLE** &lt;/Text&gt;

&lt;/PFDComment&gt;</Tag>
  </Box>

</Boxes>

I tried it something below but couldn't get the information that I want.

def read_cooments(xml):
    tree = lxml.etree.parse(xml)

    Comments= {}
    for comment in tree.xpath("//Boxes/Box"):
    #                                
        get_id = comment.attrib['Id']
        Comments[get_id] = []
        for group in comment.xpath(".//Tag"):
        #                        
            Comments[get_id].append(group.text)

    df_name1 = pd.DataFrame(dict([(k,pd.Series(v)) for k,v in Comments.items()]))  

Can anyone help to extract comments from XML file shown above? Any help is appreciated!


Solution

  • Use the code given below:

    def read_comments(xml):
        tree = etree.parse(xml)
        rows= []
        for box in tree.xpath('Box'):
            id = box.attrib['Id']
            tagTxt = box.findtext('Tag')
            if tagTxt is None:
                continue
            txtNode = etree.XML(tagTxt).find('Text')
            if txtNode is None:
                continue
            rows.append([id, txtNode.text.strip()])
        return pd.DataFrame(rows, columns=['id', 'Comment'])
    

    Note that if you create a DataFrame within a function, it is a local variable of this function and is not visible from outside. A better and more readable approach (as I did) is that the function returns this DataFrame.

    This function contains also continue in 2 places, to guard against possible "error cases", when either Box element does not contain Tag child or Tag does not contain any Text child element.

    I also noticed that there is no need to replace &lt; or &gt; with < or > with my own code, as lxml performs it on its own.

    Edit

    My test is as follows: Start form imports:

    import pandas as pd
    from lxml import etree
    

    I used a file containing:

    <Boxes>
      <Box Id="3" ZIndex="13">
        <Shape>Rectangle</Shape>
        <Brush Id="0" />
        <Pen>
          <Color>#FF000000</Color>
        </Pen>
        <Tag>&lt;?xml version="1.0"?&gt;
    &lt;PFDComment xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"&gt;
      &lt;Text&gt;**EXAMPLE** &lt;/Text&gt;
    &lt;/PFDComment&gt;</Tag>
      </Box>
    </Boxes>
    

    I called the above function:

    df_name1 = read_comments('Boxes.xml')
    

    and when I printed df_name1, I got:

      id      Comment
    0  3  **EXAMPLE**
    

    If something goes wrong, use the "extended" version of the above function, with test printouts:

    def read_comments(xml):
        tree = etree.parse(xml)
        rows= []
        for box in tree.xpath('Box'):
            id = box.attrib['Id']
            tagTxt = box.findtext('Tag')
            if tagTxt is None:
                print('No Tag element')
                continue
            txtNode = etree.XML(tagTxt).find('Text')
            if txtNode is None:
                print('No Text element')
                continue
            txt = txtNode.text.strip()
            print(f'{id}: {txt}')
            rows.append([id, txt])
        return pd.DataFrame(rows, columns=['id', 'Comment'])
    

    and take a look at printouts.