pythonxmlnestediterparse

How to apply xmlTree iterparse to nested XML set


I am trying to replicate the example from this tutorial, but using iterparse with elem.clear().

XML example:

<?xml version="1.0" encoding="UTF-8"?>
<scenario>
    <world>
        <region name="USA">
            <AgSupplySector name="Corn" nocreate="1">
                <AgSupplySubsector name="Corn_NelsonR" nocreate="1">
                    <AgProductionTechnology name="Corn_NelsonR" nocreate="1">
                        <period year="1975">
                            <Non-CO2 name="SO2_1_AWB">
                                <input-emissions>3.98749e-05</input-emissions>
                                <output-driver/>
                                <gdp-control name="GDP_control">
                                    <max-reduction>60</max-reduction>
                                    <steepness>3.5</steepness>
                                </gdp-control>
                            </Non-CO2>
                            <Non-CO2 name="NOx_AWB">
                                <input-emissions>0.000285263</input-emissions>
                                <output-driver/>
                                <gdp-control name="GDP_control">
                                    <max-reduction>60</max-reduction>
                                    <steepness>3.5</steepness>
                                </gdp-control>
                            </Non-CO2>
                        </period>
                    </AgProductionTechnology>
                </AgSupplySubsector>
            </AgSupplySector>
        </region>
    </world>
</scenario>                         

The output is expected like this: table I am trying to parse it using the following code:

import os
import xml.etree.cElementTree as etree
import codecs
import csv

PATH = 'D:\Book1'
FILENAME_BIO = 'Test.csv'
FILENAME_XML = 'all_aglu_emissions.xml'
ENCODING = "utf-8"


pathBIO = os.path.join(PATH, FILENAME_BIO)
pathXML = os.path.join(PATH, FILENAME_XML)

with codecs.open(pathBIO, "w", ENCODING) as bioFH:
    bioWriter = csv.writer(bioFH, quoting=csv.QUOTE_MINIMAL)
    bioWriter.writerow(['Year','Gas', 'Value','Technology','Crop','Country'])

    for event, elem in etree.iterparse(pathXML, events=('start','end')):
        if event == 'start' and elem.tag == 'region':
            str1 = elem.attrib['name']
        elif event == 'start' and elem.tag == 'AgSupplySector':
            str2 = elem.attrib['name']
        elif event == 'start' and elem.tag == 'AgProductionTechnology':
            str3 = elem.attrib['name']
        elif event == 'start' and elem.tag == 'period':
            str4 = elem.attrib['year']
        elif event == 'start' and elem.tag == 'Non-CO2':
            str5 = elem.attrib['name']
        elif event == 'end' and elem.tag == 'input-emissions':
            for em in elem.iter('input-emissions'):
                str6 = em.text
                bioWriter.writerow([str4, str5, str6, str3, str2, str1])
            
            elem.clear()

My issue(s) here is that I got more extra lines with empty fields for str6. Probably, I have nesting problem here. Please help. Error example (0 fields appear): enter image description here


Solution

  • The for em in elem.iter('input-emissions') loop is useless, drop it.

    import os
    import xml.etree.ElementTree as etree
    import csv
    
    PATH = '.'
    FILENAME_BIO = 'Test.csv'
    FILENAME_XML = 'all_aglu_emissions.xml'
    
    
    pathBIO = os.path.join(PATH, FILENAME_BIO)
    pathXML = os.path.join(PATH, FILENAME_XML)
    
    with open(pathBIO, 'w', encoding='utf8', newline='') as bioFH:
        bioWriter = csv.writer(bioFH, quoting=csv.QUOTE_MINIMAL)
        bioWriter.writerow('Year Gas Value Technology Crop Country'.split())
    
        for event, elem in etree.iterparse(pathXML, events=('start',)):
            if elem.tag == 'region':
                str1 = elem.attrib['name']
            elif elem.tag == 'AgSupplySector':
                str2 = elem.attrib['name']
            elif elem.tag == 'AgProductionTechnology':
                str3 = elem.attrib['name']
            elif elem.tag == 'period':
                str4 = elem.attrib['year']
            elif elem.tag == 'Non-CO2':
                str5 = elem.attrib['name']
            elif elem.tag == 'input-emissions':
                str6 = elem.text
                bioWriter.writerow([str4, str5, str6, str3, str2, str1])
            elem.clear()
    

    There are some other subtle changes I made to the code, since I assume you're using Python 3 for this. They include using xml.etree.ElementTree instead of the obsolete xml.etree.cElementTree, skipping the codecs module (Python 3 can do that natively) and passing the newline='' parameter to the open() call, so the csv module can handle newlines correctly by itself.

    Since listening to the start event is enough for the desired effect, I've dropped handling the end event entirely.

    The result is

    Year,Gas,Value,Technology,Crop,Country
    1975,SO2_1_AWB,3.98749e-05,Corn_NelsonR,Corn,USA
    1975,NOx_AWB,0.000285263,Corn_NelsonR,Corn,USA