I am trying to use the separate_longer_delim() function from the tidyr package to split a string into separate rows.
This is how the dataset looks like:
Subject | Outcome |
---|---|
001 | ["Rubbing of nose or ears","Itching of the mouth, tongue, throat","Mild pain","Eye itching"] |
002 | ["Itchy Tongue", Mild pain","Eye irritation"] |
And I would like the dataset to be:
Subject | Outcome |
---|---|
001 | Rubbing of nose or ears |
001 | Itching of the mouth, tongue, throat |
001 | Mild pain |
002 | Itchy Tongue |
002 | Mild pain |
002 | Eye irritation |
Code for data:
subject <- c('001', '002')
outcome <- c('["Rubbing of nose or ears","Itching of the mouth, tongue, throat","Mild pain","Eye itching"]', '["Itchy Tongue", Mild pain","Eye irritation"]')
dta <- as.data.frame(cbind(subject, outcome))
library(dplyr)
library(stringr)
library(tidyr)
dta |>
mutate(outcome = str_replace_all(outcome, '\",', "&") |> str_remove_all("[^[\\w,& ]+]")) |>
separate_longer_delim(outcome, "&")
This replaces ",
with a new delimiter &
since end quote + a comma delimits the elements of outcome
. We can then split on this new delimiter.
Alternatively, you can split on ",
and then remove everything that is not a a word character, comma, or space:
dta |>
separate_longer_delim(outcome, '\",') |>
mutate(outcome = str_remove_all(outcome, "[^[\\w ,]+]"))
Output
subject outcome
1 001 Rubbing of nose or ears
2 001 Itching of the mouth, tongue, throat
3 001 Mild pain
4 001 Eye itching
5 002 Itchy Tongue
6 002 Mild pain
7 002 Eye irritation