I have a messy table I read into a data frame similar to the following simplified one, which I want to parse.
mydf <- data.frame(group=LETTERS[1:5], code=101:105, ids=c('g1:id1,id2,id3\ng2:id4,id5',
'id6,id7,id8,id9',
'g1:id10,id11\ng3:id12',
'g2:id13,id14',
'id15'))
It looks like:
> mydf
group code ids
1 A 101 g1:id1,id2,id3\ng2:id4,id5
2 B 102 id6,id7,id8,id9
3 C 103 g1:id10,id11\ng3:id12
4 D 104 g2:id13,id14
5 E 105 id15
I need to separate the collapsed ids
column into multiple rows, to obtain a final data frame into long format.
For this, I would usually apply separate_rows
from tidyr
, but this case has an additional problem: some of the collapsed ids
are grouped into subgroups.
I guess I should first separate those subgroups into different ids columns, and then apply separate_rows
or separate_longer_delim
on those columns... I do not know how to go about this.
The final data frame I expect would be like this, in long format, with the subgroup
information in an additional column:
> mydf_new
group code id subgroup
1 A 101 id1 g1
2 A 101 id2 g1
3 A 101 id3 g1
4 A 101 id4 g2
5 A 101 id5 g2
6 B 102 id6 <NA>
7 B 102 id7 <NA>
8 B 102 id8 <NA>
9 B 102 id9 <NA>
10 C 103 id10 g1
11 C 103 id11 g1
12 C 103 id12 g3
13 D 104 id13 g2
14 D 104 id14 g2
15 E 105 id15 <NA>
One way is to separate wider / longer across a few steps:
library(tidyr)
library(dplyr)
mydf |>
separate_longer_delim(ids, "\n") |>
separate_wider_delim(ids, ":", names = c("subgroup", "id"), too_few = "align_end") |>
separate_longer_delim(id, ",") |>
relocate(subgroup, .after = last_col())
# A tibble: 15 × 4
group code id subgroup
<chr> <int> <chr> <chr>
1 A 101 id1 g1
2 A 101 id2 g1
3 A 101 id3 g1
4 A 101 id4 g2
5 A 101 id5 g2
6 B 102 id6 NA
7 B 102 id7 NA
8 B 102 id8 NA
9 B 102 id9 NA
10 C 103 id10 g1
11 C 103 id11 g1
12 C 103 id12 g3
13 D 104 id13 g2
14 D 104 id14 g2
15 E 105 id15 NA