python-3.xxmlminidom

Why does an extraneous text pop up while I print the nodeName?


Suppose, I have the follwing XML file:

<?xml version="1.0" encoding="utf-8"?>
<library attrib1="att11" attrib2="att22">
    library-text
    <book isbn="1111111111">
        <title lang="en">T1 T1 T1 T1 T1</title>
        <date>2001</date>
        <author>A1 A1 A1 A1 A1</author>     
        <price>10.00</price>
    </book>
    <book isbn="2222222222">
        <title lang="en">T2 T2 T2 T2 T2</title>
        <date>2002</date>
        <author>A2 A2 A2 A2 A2</author>     
        <price>20.00</price>
    </book>
    <book isbn="3333333333">
        <title lang="en">T3 T3 T3 T3</title>
        <date>2003</date>
        <author>A3 A3 A3 A3 A3y</author>        
        <price>30.00</price>
    </book>
</library>

main.py

import xml.dom.minidom as minidom

xml_fname = "library.xml"

dom = minidom.parse(xml_fname) 

for node in dom.firstChild.childNodes:
    print(node.nodeName)

output

#text
book
#text
book
#text
book
#text

Why does the output show #text? Where is it coming from?


Solution

  • If you change print(node.nodeName) to print(node) you will see the output

    <DOM Text node "'\n    libra'...">
    <DOM Element: book at 0x11f48ec8>
    <DOM Text node "'\n    '">
    <DOM Element: book at 0x11f50070>
    <DOM Text node "'\n    '">
    <DOM Element: book at 0x11f501d8>
    <DOM Text node "'\n'">
    

    minidom treats the "free text" "nodes" as actual, nameless DOM text nodes with the name #text.

    If you only want the book nodes, be explicit about it:

    for node in dom.getElementsByTagName('book'):
        print(node.nodeName)
    

    outputs

    book
    book
    book
    

    Keep in mind that the usage of minidom is not encouraged. From the official Python docs:

    Users who are not already proficient with the DOM should consider using the xml.etree.ElementTree module for their XML processing instead.

    Consider using ElementTree:

    import xml.etree.ElementTree as ET
    
    xml_fname = "library.xml"
    
    root = ET.parse(xml_fname)
    
    for node in root.findall('book'):
        print(node.tag)
    

    also outputs

    book
    book
    book