pythonexcelxml

Need help to extract this XML node - Excel connection strings in Python


I have a Python program opening up Excel (XLSX) files, and trying to find the <connection> node.

This is the full XML from the connections.xml file.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<connections 
    xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" 
    xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" 
    mc:Ignorable="xr16" 
    xmlns:xr16="http://schemas.microsoft.com/office/spreadsheetml/2017/revision16">
    <connection 
        id="1" xr16:uid="{#####}" keepAlive="1" 
        name="Query - CargoData_small" 
        description="Connection to the 'CargoData_small' query in the workbook." 
        type="5" refreshedVersion="7" background="1">
        <dbPr connection="Provider=Microsoft.Mashup.OleDb.1;Data Source=$Workbook$;Location=CargoData_small;Extended Properties=&quot;&quot;" 
            command="SELECT * FROM [CargoData_small]"/>
    </connection>
</connections>

I am trying to get to find the <dbPr> node. but I am stuck on the child node of the code as shown below:

def checkfile(filename):
    if zipfile.is_zipfile(filename):
        zf = zipfile.ZipFile(filename, 'r')
        if "xl/connections.xml" in zf.namelist():
            print(filename)
            xml = zf.read('xl/connections.xml')
            root = parseString(xml)
            connections = root.getElementsByTagName('connection')
            try:
                for con in connections:
                    for child in con.childNodes:

                        # there are no 'children'
                        for children in child.childNodes:
                            dsn = children.attributes.values()[0].nodeValue
                            sql = children.attributes.values()[1].nodeValue
                            writeoutput(filename, dsn, sql )
            except:
                pass

So what happens is I get the 'child' value, but I cannot find the dbPr section.

This is what I am getting as an error:

{TypeError}TypeError("'dict_values' object is not subscriptable")

I am using Pycharm as the IDE.

Thanks


Solution

  • You're missing the namespaces. The xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" in the root node means it and its descendants that don't have another namespace attached have that namespace, and as such you'll need to qualify your queries with that namespace.

    A minimal example using the built-in xml.etree.ElementTree, and the data hard-coded for convenience...

    data = """<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <connections 
        xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" 
        xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" 
        mc:Ignorable="xr16" 
        xmlns:xr16="http://schemas.microsoft.com/office/spreadsheetml/2017/revision16">
        <connection 
            id="1" xr16:uid="{#####}" keepAlive="1" 
            name="Query - CargoData_small" 
            description="Connection to the 'CargoData_small' query in the workbook." 
            type="5" refreshedVersion="7" background="1">
            <dbPr connection="Provider=Microsoft.Mashup.OleDb.1;Data Source=$Workbook$;Location=CargoData_small;Extended Properties=&quot;&quot;" 
                command="SELECT * FROM [CargoData_small]"/>
        </connection>
    </connections>
    """
    
    from xml.etree import ElementTree as ET
    
    root = ET.fromstring(data)
    for connection in root.findall('.//{http://schemas.openxmlformats.org/spreadsheetml/2006/main}connection'):
        for dbpr in connection.findall('.//{http://schemas.openxmlformats.org/spreadsheetml/2006/main}dbPr'):
            print(connection.attrib['name'], dbpr.attrib['connection'], dbpr.attrib['command'])
    

    This prints out

    Query - CargoData_small Provider=Microsoft.Mashup.OleDb.1;Data Source=$Workbook$;Location=CargoData_small;Extended Properties="" SELECT * FROM [CargoData_small]