rtidyr

Transform dataframe to long format based on text


I have a dataset from a survey. The setup of the survey was as follows: Q1 was a multiple choice question, and in Q2 answers are given for each of the selected choices in Q1 (same for Q3 up to Q9). After exporting and uploading to R, the dataframe is as follows:

df <- data.frame(
  ID = 1:3,
  Q1 = c("Purple, Red, Blue", "Red, Blue, Yellow", "Red"),
  "Q2: Is it your favorite color? - Purple" = c("Average", NA, NA),
  "Q2: Is it your favorite color? - Red" = c("No", "No", "Average"),
  "Q2: Is it your favorite color? - Blue" = c("Yes", "No", NA),
  "Q2: Is it your favorite color? - Yellow" = c(NA, "Average", NA)
)

What I'm looking to do is to transform so that the dataframe is a long format like below

ID Q1 Q2
1 Purple Average
1 Red No
1 Blue Yes
2 Red No
2 Blue No
2 Yellow Average
3 Red Average

I was thinking to first separate the answers to Q1 into rows, and then pivot_longer the Q2 columns based on the selected option in Q2.

df1 <- df %>% 
  separate_rows(Q1, sep = ",")

df2 <- df1 %>% 
  pivot_longer(cols = matches(Q1),
               names_to = Q2,
               values_to = Q2_answer)

But this doesn't work (error in pivot_longer: selections can't have missing values).

What am I doing wrong? Or is there another way to do this?


Solution

  • Try this:

    library(dplyr)
    library(tidyr)
    pivot_longer(df, cols = -c(ID, Q1), values_to = "Q2") |>
      separate_longer_delim(Q1, delim = ",") |>
      mutate(Q1 = trimws(Q1)) |>
      filter(sub(".* ", "", name) == Q1) |>
      select(-name)
    # # A tibble: 7 × 3
    #      ID Q1     Q2     
    #   <int> <chr>  <chr>  
    # 1     1 Purple Average
    # 2     1 Red    No     
    # 3     1 Blue   Yes    
    # 4     2 Red    No     
    # 5     2 Blue   No     
    # 6     2 Yellow Average
    # 7     3 Red    Average
    

    Data, adding check.names=:

    df <- data.frame(
      ID = 1:3,
      Q1 = c("Purple, Red, Blue", "Red, Blue, Yellow", "Red"),
      "Q2: Is it your favorite color? - Purple" = c("Average", NA, NA),
      "Q2: Is it your favorite color? - Red" = c("No", "No", "Average"),
      "Q2: Is it your favorite color? - Blue" = c("Yes", "No", NA),
      "Q2: Is it your favorite color? - Yellow" = c(NA, "Average", NA),
      check.names = FALSE)