I am attempting to convert this xml to a dataframe in R:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml
library(xml2)
library(tidyverse)
fileurl <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"
xmllist <- as_list(read_xml(fileurl))
xml_df = tibble::as_tibble(xmllist) %>%
unnest_longer(response)
row_wider = xml_df %>%
unnest_wider(response)
row_df = row_wider %>%
unnest(cols = names(.)) %>%
unnest(cols = names(.)) %>%
readr::type_convert()
The issue is that the 'location_1' column is a dictionary and shows up as NA when I unnest. How can I get each of the values of this dictionary into this column? Any help is much appreciated, thanks.
The location_1
column is an empty list (hence you get NA
s) with two attributes human_address
which is a JSON string and a logical needs_recoding
. One option to get your desired result would be to first extract the content of these attributes and store them in a list
. Afterwards you could use two unnest_wider
to unnest the list column.
library(xml2)
library(tidyverse)
parse_location_1 <- function(x) {
x$location_1 <- list(
human_address = jsonlite::fromJSON(attr(x$location_1, "human_address")),
needs_recoding = attr(x$location_1, "needs_recoding")
)
x
}
fileurl <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"
xmllist <- as_list(read_xml(fileurl))
xml_df <- tibble::as_tibble(xmllist) %>%
unnest_longer(response) |>
mutate(response = map(
response, parse_location_1
))
row_wider <- xml_df %>%
unnest_wider(response) |>
unnest_wider(location_1) |>
unnest_wider(human_address)
row_df <- row_wider %>%
unnest(cols = where(is.list)) %>%
unnest(cols = where(is.list)) %>%
readr::type_convert()
#>
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#> name = col_character(),
#> zipcode = col_double(),
#> neighborhood = col_character(),
#> councildistrict = col_double(),
#> policedistrict = col_character(),
#> address = col_character(),
#> city = col_character(),
#> state = col_character(),
#> zip = col_logical(),
#> needs_recoding = col_logical(),
#> response_id = col_character()
#> )
head(row_df)
#> # A tibble: 6 × 11
#> name zipcode neighborhood councildistrict policedistrict address city state
#> <chr> <dbl> <chr> <dbl> <chr> <chr> <chr> <chr>
#> 1 410 21206 Frankford 2 NORTHEASTERN 4509 B… Balt… MD
#> 2 1919 21231 Fells Point 1 SOUTHEASTERN 1919 F… Balt… MD
#> 3 SAUTE 21224 Canton 1 SOUTHEASTERN 2844 H… Balt… MD
#> 4 #1 CH… 21211 Hampden 14 NORTHERN 3998 R… Balt… MD
#> 5 #1 ch… 21223 Millhill 9 SOUTHWESTERN 2481 f… Balt… MD
#> 6 19TH … 21218 Clifton Park 14 NORTHEASTERN 2722 H… Balt… MD
#> # ℹ 3 more variables: zip <lgl>, needs_recoding <lgl>, response_id <chr>