rxmlxpathxml2

Turning xml file into dataframe


I'm trying to extract data from an xml file. I'm extracting the nodes separately with the following code:

entity_uin <- xml_text(xml_find_all(xml, ".//Entity/EntityUin"))
entity_name <- xml_text(xml_find_all(xml, ".//Entity/EntityName"))
entity_zip_code <- xml_text(xml_find_all(xml, ".//Entity/EntityZipCode"))

This way I'm getting three character vectors. Then, I'm trying to create a tibble from these character vectors with the following code:

xml <- tibble(entity_uin, entity_name, entity_zip_code)

Unfortunately, this doesn't work because the three character vectors are with unequal lengths. Can anyone suggest a solution?


Solution

  • Assuming(!) that some Entity nodes in your document are not complete and error is raised because some of your column vectors are shorter than others, you could first get a set of parent nodes and extract details from those with xml_find_first(). xml_find_first() output is always the same size as the input, missing matches are filled with NAs, resulting vectors are aligned and can be passed to tibble():

    library(xml2)
    example_xml <- 
    '<?xml version="1.0" encoding="UTF-8"?>
    <Entities>
      <Entity>
        <EntityUin>123456</EntityUin>
        <EntityName>ABC Corp</EntityName>
        <EntityZipCode>10001</EntityZipCode>
      </Entity>
      <Entity>
        <EntityUin>789012</EntityUin>
        <EntityName>XYZ Inc</EntityName>
        <!-- Missing EntityZipCode -->
      </Entity>
      <Entity>
        <EntityUin>345678</EntityUin>
        <EntityName>Sample LLC</EntityName>
        <EntityZipCode>90210</EntityZipCode>
      </Entity>
    </Entities>'
    
    entities <- 
      read_xml(example_xml) |>
      xml_find_all("/Entities/Entity")
    
    tibble::tibble(
      entity_uin      = xml_find_first(entities, "./EntityUin")     |> xml_text(),
      entity_name     = xml_find_first(entities, "./EntityName")    |> xml_text(), 
      entity_zip_code = xml_find_first(entities, "./EntityZipCode") |> xml_text()
    )
    #> # A tibble: 3 × 3
    #>   entity_uin entity_name entity_zip_code
    #>   <chr>      <chr>       <chr>          
    #> 1 123456     ABC Corp    10001          
    #> 2 789012     XYZ Inc     <NA>           
    #> 3 345678     Sample LLC  90210