pythonxmlelementtreecelementtree

Python - XML: Separating siblings per parent


Currently I am struggling to find the proper answer to this, so it would be great if someone could help me solve this. I have a deeper XML which I want to convert into a table. the XML looks like this:

<Motherofall>
 <Parent>
  <Child>
   <val1>XX1</val1>
  <Child2>
   <val2>YY1</val2>
   <val2>YY2</val2>
  <Child2>
   <val2>YY3</val2>
   <val2>YY4</val2>
 </parent>
+<parent>
+<parent>
</Motherofall>

So eventually what I want to have as output would be a table with column val1 and a colmun val2. So val1 is repeated twice per parent.

Picture of table as pictured

import xml.etree.ElementTree as et

tree = et.parse(last_file)
for node in tree.findall('.//Parent'):
    XX = node.find('.//Child')
    print(XX.text)
for node2 in tree.findall('.//Child2'):
        YY = node2.find('.//val1')
        print(YY.text)

As one might notice I am fairly new to this, however I could not find a fitting answer.


Solution

  • I started from bringing some order to your input file (e.g. added missing closing tags), so that it contains:

    <Motherofall>
        <parent>
            <Child>
                <val1>XX1</val1>
            </Child>
            <Child2>
                <val2>YY1</val2>
                <val2>YY2</val2>
            </Child2>
            <Child2>
                <val2>YY3</val2>
                <val2>YY4</val2>
            </Child2>
        </parent>
        <parent>
            <Child>
                <val1>XX2</val1>
            </Child>
            <Child2>
                <val2>YY1</val2>
                <val2>YY2</val2>
            </Child2>
            <Child2>
                <val2>YY3</val2>
            </Child2>
        </parent>
    </Motherofall>
    

    The initial part of code is to read the XML:

    import xml.etree.ElementTree as et
    
    tree = et.parse('Input.xml')
    root = tree.getroot()
    

    Then to read data from it and create a Pandas DataFrame, you can run:

    rows = []
    for par in root.iter('parent'):
        xx = par.findtext('Child/val1')
        for vv in par.findall('Child2/val2'):
            tt = vv.text
            rows.append([xx, tt])
    df = pd.DataFrame(rows, columns=['x', 'y'])
    

    The result is:

         x    y
    0  XX1  YY1
    1  XX1  YY2
    2  XX1  YY3
    3  XX1  YY4
    4  XX2  YY1
    5  XX2  YY2
    6  XX2  YY3