rexcelxmlxmlnode

R - Get XML nodes


I have a question about XML nodes.

I want to replace both of values and formulas in Excel file, so first, I changed it to zip file using file.rename(), after, I work in XML file of zip file. You can find XML in /xl/worksheets/sheet1.xml, to replace, I write code lines:

library(XML)

### Change excel to zip
data_path <- "Untitled 1.xlsx"

file.rename(data_path, "Untitled 1.zip")

dir.create("Untitled 1")
utils::unzip("Untitled 1.zip", exdir = "Untitled 1")

### Read XML file
doc <- xmlTreeParse("Untitled 1/xl/worksheets/sheet1.xml", useInternal = TRUE)

#### Select the nodes we want to update
nodes <- getNodeSet(doc, "/worksheet")
nodes <- getNodeSet(doc, "/worksheet/sheetData")

# > list()
# > list()

When I get nodes, all of them return blank list nodes. It worked with getNodeSet(doc, "/"), Xpath expression is /, it return list having length = 1. I don't understand why that? What wrong is XML file? I have just tested in https://www.freeformatter.com/xpath-tester.html, it still return right result if Xpath expression is /worksheet or /worksheet/sheetData

Can you explain the problem is? Thank you.

I attached the excel file, output zip file and XML file in here


Solution

  • Your document has a default namespace given in the root "worksheet" node. The XML library needs the default namespaces to be named. You can do something like this

    # name the namespaces
    ns <- c(ns="http://schemas.openxmlformats.org/spreadsheetml/2006/main",
            r="http://schemas.openxmlformats.org/officeDocument/2006/relationships")
    
    # use the ns= namespace in the xpath
    getNodeSet(doc, "/ns:worksheet", namespaces = ns)
    getNodeSet(doc, "/ns:worksheet/ns:sheetData", namespaces = ns)
    

    I know with xml2, it's possible to strip the default namespace if you don't want to bother with it. You could do

    library(xml2)
    doc <- xml_ns_strip(read_xml("sheet1.xml"))
    xml_find_all(doc, "/worksheet")
    xml_find_all(doc, "/worksheet/sheetData")