pythonxmlnestedminidom

Parsing nested XML structure using minidom in Python


I am Python XML beginner and I have an issue to get data from the given XML file:

<?xml version="1.0" encoding="UTF-8"?>
<martif xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="en">
   <cat>
      <desc type="No">1</desc>
      <desc type="Main">DES1.1</desc>
      <desc type="Sub">DES1.2</desc>
      <lang xml:lang="EN">
         <t>
            <term>T1.1</term>
            <Typ type="TermType">main</Typ>
         </t>
         <t>
            <term>T1.2</term>
            <Typ type="TermType">option</Typ>
         </t>
      </lang>
      <lang xml:lang="FR">
         <t>
            <term>T1.3</term>
            <Typ type="TermType">main</Typ>
         </t>
         <t>
            <term>T1.4</term>
            <Typ type="TermType">option</Typ>
         </t>
      </lang>
   </cat>
   <cat>
      <desc type="No">2</desc>
      <desc type="Main">DES2.1</desc>
      <desc type="Sub">DES2.2</desc>
      <lang xml:lang="EN">
         <t>
            <term>T2.1</term>
            <Typ type="TermType">main</Typ>
         </t>
         <t>
            <term>T2.2</term>
            <Typ type="TermType">option</Typ>
         </t>
      </lang>
      <lang xml:lang="FR">
         <t>
            <term>T2.3</term>
            <Typ type="TermType">main</Typ>
         </t>
         <t>
            <term>T2.4</term>
            <Typ type="TermType">option</Typ>
         </t>
      </lang>
   </cat>
</martif>

The desired result should be:

Type:  Main      Category: DES1.1
Type:  Sub       Category: DES1.2
lang:  EN
Term:  T2.1
TermType: main
Term:  T1.2
TermType: option
lang:  FR
Term:  T1.3
Term Note: main
Term:  T1.4
TermType: option

Type:  Main      Category: DES2.1
Type:  Sub       Category: DES2.2
lang:  EN
Term:  T2.1
TermType: main
Term:  T2.2
TermType: option
lang:  FR
Term:  T2.3
Term Note: main
Term:  T2.4
TermType: option

I tried but I still have some issue to get the desired result, the issue is how to extract the data based on the given xml data structure.

Here is my code:

from xml.dom import minidom

doc = minidom.parse("data.xml")
descs = doc.getElementsByTagName("desc")

for desSetElem in descs:
      type = desSetElem.getAttribute("type")
      if type!='No':
        print('Type: ',type,'     Category:',desSetElem.firstChild.nodeValue)
        lang_termSetElem = doc.getElementsByTagName('lang')
        for lang_term in lang_termSetElem:
             # for lang_tig in lang_tigSetElem:
               lang_type=lang_term.getAttribute(('xml:lang'))
               print('lang: ',lang_type)
               print('Term: ',lang_term.getElementsByTagName("term")[0].firstChild.nodeValue)
               print('Term Type:',lang_term.getElementsByTagName("Typ")[0].firstChild.nodeValue)

Here the result I got:

Type:  Main      Category: DES1.1
lang:  EN
Term:  T1.1
Term Type: main
lang:  FR
Term:  T1.3
Term Type: main
lang:  EN
Term:  T2.1
Term Type: main
lang:  FR
Term:  T2.3
Term Type: main
Type:  Sub      Category: DES1.2
lang:  EN
Term:  T1.1
Term Type: main
lang:  FR
Term:  T1.3
Term Type: main
lang:  EN
Term:  T2.1
Term Type: main
lang:  FR
Term:  T2.3
Term Type: main
Type:  Main      Category: DES2.1
lang:  EN
Term:  T1.1
Term Type: main
lang:  FR
Term:  T1.3
Term Type: main
lang:  EN
Term:  T2.1
Term Type: main
lang:  FR
Term:  T2.3
Term Type: main
Type:  Sub      Category: DES2.2
lang:  EN
Term:  T1.1
Term Type: main
lang:  FR
Term:  T1.3
Term Type: main
lang:  EN
Term:  T2.1
Term Type: main
lang:  FR
Term:  T2.3
Term Type: main

Solution

  • Consider walking down the three levels of XML with your looping: <cat>, <desc>/<lang>, and <t>. Specifically, since <lang> is a sibling of <desc> it should not be a nested loop. Also, <t> elements would need to be iterated.

    Consider also using F-strings (Python 3.6+) and line breaking to conform to PEP-8 standards of 80 characters.

    from xml.dom import minidom
    
    doc = minidom.parse("MiniDOMPrintOutput.xml")
    cats = doc.getElementsByTagName("cat")
    
    for catElem in cats:
        descs = catElem.getElementsByTagName("desc")
        for desSetElem in descs:
            type = desSetElem.getAttribute("type")
            if type != 'No':
                print(f"Type: {type.ljust(9)}"
                      f"Category: {desSetElem.firstChild.nodeValue}")
    
        lang_termSetElem = catElem.getElementsByTagName("lang")
        for lang_term in lang_termSetElem:
            lang_type = lang_term.getAttribute(("xml:lang"))
            print(f"lang: {lang_type}")
    
            lang_tigSetElem = lang_term.getElementsByTagName("t")
            for lang_tig in lang_tigSetElem:
                term = (lang_tig.getElementsByTagName('term')[0]
                                .firstChild
                                .nodeValue)
                Typ = (lang_tig.getElementsByTagName('Typ')[0]
                               .firstChild
                               .nodeValue)
    
                print(f"Term: {term}")
                print(f"Term Type: {Typ}")
    

    Output

    Type: Main     Category: DES1.1
    Type: Sub      Category: DES1.2
    lang: EN
    Term: T1.1
    Term Type: main
    Term: T1.2
    Term Type: option
    lang: FR
    Term: T1.3
    Term Type: main
    Term: T1.4
    Term Type: option
    Type: Main     Category: DES2.1
    Type: Sub      Category: DES2.2
    lang: EN
    Term: T2.1
    Term Type: main
    Term: T2.2
    Term Type: option
    lang: FR
    Term: T2.3
    Term Type: main
    Term: T2.4
    Term Type: option