I have a Python program opening up Excel (XLSX) files, and trying to find the <connection>
node.
This is the full XML from the connections.xml
file.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<connections
xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
mc:Ignorable="xr16"
xmlns:xr16="http://schemas.microsoft.com/office/spreadsheetml/2017/revision16">
<connection
id="1" xr16:uid="{#####}" keepAlive="1"
name="Query - CargoData_small"
description="Connection to the 'CargoData_small' query in the workbook."
type="5" refreshedVersion="7" background="1">
<dbPr connection="Provider=Microsoft.Mashup.OleDb.1;Data Source=$Workbook$;Location=CargoData_small;Extended Properties="""
command="SELECT * FROM [CargoData_small]"/>
</connection>
</connections>
I am trying to get to find the <dbPr>
node. but I am stuck on the child node of the code as shown below:
def checkfile(filename):
if zipfile.is_zipfile(filename):
zf = zipfile.ZipFile(filename, 'r')
if "xl/connections.xml" in zf.namelist():
print(filename)
xml = zf.read('xl/connections.xml')
root = parseString(xml)
connections = root.getElementsByTagName('connection')
try:
for con in connections:
for child in con.childNodes:
# there are no 'children'
for children in child.childNodes:
dsn = children.attributes.values()[0].nodeValue
sql = children.attributes.values()[1].nodeValue
writeoutput(filename, dsn, sql )
except:
pass
So what happens is I get the 'child' value, but I cannot find the dbPr section.
This is what I am getting as an error:
I am using Pycharm as the IDE.
Thanks
You're missing the namespaces. The xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
in the root node means it and its descendants that don't have another namespace attached have that namespace, and as such you'll need to qualify your queries with that namespace.
A minimal example using the built-in xml.etree.ElementTree
, and the data hard-coded for convenience...
data = """<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<connections
xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
mc:Ignorable="xr16"
xmlns:xr16="http://schemas.microsoft.com/office/spreadsheetml/2017/revision16">
<connection
id="1" xr16:uid="{#####}" keepAlive="1"
name="Query - CargoData_small"
description="Connection to the 'CargoData_small' query in the workbook."
type="5" refreshedVersion="7" background="1">
<dbPr connection="Provider=Microsoft.Mashup.OleDb.1;Data Source=$Workbook$;Location=CargoData_small;Extended Properties="""
command="SELECT * FROM [CargoData_small]"/>
</connection>
</connections>
"""
from xml.etree import ElementTree as ET
root = ET.fromstring(data)
for connection in root.findall('.//{http://schemas.openxmlformats.org/spreadsheetml/2006/main}connection'):
for dbpr in connection.findall('.//{http://schemas.openxmlformats.org/spreadsheetml/2006/main}dbPr'):
print(connection.attrib['name'], dbpr.attrib['connection'], dbpr.attrib['command'])
This prints out
Query - CargoData_small Provider=Microsoft.Mashup.OleDb.1;Data Source=$Workbook$;Location=CargoData_small;Extended Properties="" SELECT * FROM [CargoData_small]