rtidyr

Separate multiple columns of nested data and clean them at the same time using base R and Tidyverse


Question: Is there a way for me to separate multiple nested columns and clean them all at the same time using base R and tidyverse? I've checked the other nested responses here, but nothing seems to fit.

My dataframe has nested data that follows the following format.

v1 is type = double, v2 is type = character

my_df
    v1      v2
1   1234    {"484391115444":192,"484391115442":61}
2   1235    {"484391115432":52,"484391115443":24}
3   1236    {"484391115302":11,"484391115161":10}

I am able to separate (up to 10) using tidyr::separate as:

my_df <- tidyr::separate(my_df, col=v2, into=c("v2_01", "v2_02"), sep = ",")

In the actual data, v2 is of varying length, meaning amount the nested data is different for each row. This is not an issue as I really only need the first 10 values of v2 stored in the column, after I separate, the rest is truncated, along with the ending } character. So far so good.

This produces:

    v1      v2_01               v2_02
1   1234    {"484391115444":192 "484391115442":61
2   1235    {"484391115432":52  "484391115443":24
3   1236    {"484391115302":11  "484391115161":10

Now I need to clean it (remove the { and " characters) and separate again on the colon. I can do this as individual commands:

my_df$v2_01 <- gsub("\\{","", as.character(my_df$v2_01))
my_df$v2_01 <- gsub('"',"", as.character(my_df$v2_01))

#separating on the ":" character

my_df <- tidyr::separate(my_df, col=v2_01, into=c("v2_01", "v2_01_Num" ,sep = ":"))

This produces:

    v1      v2_01           v2_01_Num   v2_02           v2_02_Num   :
1   1234    484391115444    92          484391115442    61          :
2   1235    484391115432    52          484391115443    24          :
3   1236    484391115302    11          484391115161    10          :

So far, meh, I do get the ":" column now. This is functional at best.

My question is the following: Is there a way for me to separate multiple columns and clean them all at the same time? I'm also getting the ":" column now too as output. I'm a little unfamiliar with looping in R, but going through and separating 10 variables out column by column manually seems horribly tedious and inefficient. The above code does work, but I have a other dataframes where the columns with nested data are in the dozens.

maybe tidyverse's separate_rows?

Thank you in advance.


Solution

  • You essentially have JSON data in strings of dat$V2, so you can use a package like jsonlite to import them in a sensible format:

    library(jsonlite)
    si <- stream_in(textConnection(dat$v2), simplifyVector=FALSE)
    dat[paste("v2", seq_along(si[[1]]), sep="_")] <- t(sapply(si, names))
    dat[paste("v2", seq_along(si[[1]]), "num", sep="_")] <- do.call(Map, c(c, si))
    dat
    
    ##    v1                                     v2         v2_1         v2_2 v2_1_num v2_2_num
    ##1 1234 {"484391115444":192,"484391115442":61} 484391115444 484391115442      192       61
    ##2 1235  {"484391115432":52,"484391115443":24} 484391115432 484391115443       52       24
    ##3 1236  {"484391115302":11,"484391115161":10} 484391115302 484391115161       11       10