rdataframeparsingdplyrdelimiter

Separate multiple columns and name column names after part of string R


I've got data which saved all the survey question responses to the same cell. I'm trying to find a way to split the data into multiple columns and use the part of each string until a set delimiter as the column name. The table looks like this at the moment

ID Response
1 "gender":"Female","occupation":"student","handed":"Left"
2 "gender":"Female","occupation":"Reporter","handed":"Right"
df <- structure(list(ID = 1:2, Response = c("\"gender\":\"Female\",\"occupation\":\"student\",\"handed\":\"Left\"", 
"\"gender\":\"Female\",\"occupation\":\"Reporter\",\"handed\":\"Right\""
)), class = "data.frame", row.names = c(NA, -2L))

and I would like it to look like this

ID gender occupation handed
1 "Female" "student" "Left"
2 "Female" "Reporter" "Right"

I've managed to separate the question and response into one column with this

df<- df%>%
  mutate(NUM = row_number()) #row number is used as a key for merge

dfdelimted<- 
  str_split_fixed(df$response, ',"', 50)

dfdelimted<-data.frame(dfdelimted)
dfdelimted<- dfdelimted%>%
  mutate(NUM = row_number()) #row number is used as a key for merge

df <-  merge(df,dfdelimted, by="NUM") 

Which outputs this:

ID v1 v2 v3
1 "gender":"Female" "occupation":"student" "handed":"Left"
2 "gender":"Female" "occupation":"Reporter" "handed":"Right"

I can't seem to figure out how to word this for a google search so any help would be great!


Solution

  • Here's how you can do it with separate_longer_delim(). Basically what we're doing is:

    1. splitting the Response column on commas, turning each element into it's own row
    2. splitting the Reponse column on a ':', into name and the value columns
    3. turning it back into wide format, with the column names coming from the name column, and the values from the value column.
    library("tidyverse")
    
    df |> 
      separate_longer_delim(Response, delim = ",") |>
      separate_wider_regex(Response, patterns = c("\"", "name" = ".*?", "\":\"","value" = ".*?", "\"")) |>
      pivot_wider()
    

    Output:

    # A tibble: 2 × 4
         ID gender occupation handed
      <int> <chr>  <chr>      <chr> 
    1     1 Female student    Left  
    2     2 Female Reporter   Right 
    

    Maybe there's a way of creating the columns directly. Of course, you could extract them with regex, but I didn't want to presume that the values in the Response column will never change, but if are constant, then that would be the more straightforward approach.