rstringtidyr

R separate_longer_delim to split strings with quotation marks


I am trying to use the separate_longer_delim() function from the tidyr package to split a string into separate rows.

This is how the dataset looks like:

Subject Outcome
001 ["Rubbing of nose or ears","Itching of the mouth, tongue, throat","Mild pain","Eye itching"]
002 ["Itchy Tongue", Mild pain","Eye irritation"]

And I would like the dataset to be:

Subject Outcome
001 Rubbing of nose or ears
001 Itching of the mouth, tongue, throat
001 Mild pain
002 Itchy Tongue
002 Mild pain
002 Eye irritation

Code for data:

subject <- c('001', '002')
outcome <- c('["Rubbing of nose or ears","Itching of the mouth, tongue, throat","Mild pain","Eye itching"]', '["Itchy Tongue", Mild pain","Eye irritation"]')

dta <- as.data.frame(cbind(subject, outcome))

Solution

  • library(dplyr)
    library(stringr)
    library(tidyr)
    
    dta |>
      mutate(outcome = str_replace_all(outcome, '\",', "&") |> str_remove_all("[^[\\w,& ]+]")) |>
      separate_longer_delim(outcome, "&")
    

    This replaces ", with a new delimiter & since end quote + a comma delimits the elements of outcome. We can then split on this new delimiter.

    Alternatively, you can split on ", and then remove everything that is not a a word character, comma, or space:

    dta |>
      separate_longer_delim(outcome, '\",') |>
      mutate(outcome = str_remove_all(outcome, "[^[\\w ,]+]"))
    

    Output

      subject                              outcome
    1     001              Rubbing of nose or ears
    2     001 Itching of the mouth, tongue, throat
    3     001                            Mild pain
    4     001                          Eye itching
    5     002                         Itchy Tongue
    6     002                            Mild pain
    7     002                       Eye irritation