I'm trying to extract data from an xml file. I'm extracting the nodes separately with the following code:
entity_uin <- xml_text(xml_find_all(xml, ".//Entity/EntityUin"))
entity_name <- xml_text(xml_find_all(xml, ".//Entity/EntityName"))
entity_zip_code <- xml_text(xml_find_all(xml, ".//Entity/EntityZipCode"))
This way I'm getting three character vectors. Then, I'm trying to create a tibble from these character vectors with the following code:
xml <- tibble(entity_uin, entity_name, entity_zip_code)
Unfortunately, this doesn't work because the three character vectors are with unequal lengths. Can anyone suggest a solution?
Assuming(!) that some Entity
nodes in your document are not complete and error is raised because some of your column vectors are shorter than others, you could first get a set of parent nodes and extract details from those with xml_find_first()
. xml_find_first()
output is always the same size as the input, missing matches are filled with NA
s, resulting vectors are aligned and can be passed to tibble()
:
library(xml2)
example_xml <-
'<?xml version="1.0" encoding="UTF-8"?>
<Entities>
<Entity>
<EntityUin>123456</EntityUin>
<EntityName>ABC Corp</EntityName>
<EntityZipCode>10001</EntityZipCode>
</Entity>
<Entity>
<EntityUin>789012</EntityUin>
<EntityName>XYZ Inc</EntityName>
<!-- Missing EntityZipCode -->
</Entity>
<Entity>
<EntityUin>345678</EntityUin>
<EntityName>Sample LLC</EntityName>
<EntityZipCode>90210</EntityZipCode>
</Entity>
</Entities>'
entities <-
read_xml(example_xml) |>
xml_find_all("/Entities/Entity")
tibble::tibble(
entity_uin = xml_find_first(entities, "./EntityUin") |> xml_text(),
entity_name = xml_find_first(entities, "./EntityName") |> xml_text(),
entity_zip_code = xml_find_first(entities, "./EntityZipCode") |> xml_text()
)
#> # A tibble: 3 × 3
#> entity_uin entity_name entity_zip_code
#> <chr> <chr> <chr>
#> 1 123456 ABC Corp 10001
#> 2 789012 XYZ Inc <NA>
#> 3 345678 Sample LLC 90210