I have data that contains the nearest gene sets, including their genomic region and strand, in one column. I want to make a new column for the single gene extracted from that column and separate them into multiple rows. Gene names are located after |+| or |-|.
Original data
original <- structure(list(id = c(1, 2), distance = c(0, 167155),
nearestTargetGeneAnchorB = c("chr1:248838178-248849517|+|ZNF672,chr1:248826377-248826443|+|MIR3124,chr1:248810452-248825955|-|SH3BP5L",
"chr1:248323630-248324568|-|OR2M7,chr1:248294616-248295578|-|OR2T12")),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-2L))
How can I transform the table to this?
fixed <- structure(list(id = c(1, 1, 1, 2, 2), distance = c(0,
0, 0, 167155, 167155), nearestTargetGeneAnchorB = c("chr1:248838178-248849517|+|ZNF672,chr1:248826377-248826443|+|MIR3124,chr1:248810452-248825955|-|SH3BP5L",
"chr1:248838178-248849517|+|ZNF672,chr1:248826377-248826443|+|MIR3124,chr1:248810452-248825955|-|SH3BP5L",
"chr1:248838178-248849517|+|ZNF672,chr1:248826377-248826443|+|MIR3124,chr1:248810452-248825955|-|SH3BP5L",
"chr1:248323630-248324568|-|OR2M7,chr1:248294616-248295578|-|OR2T12",
"chr1:248323630-248324568|-|OR2M7,chr1:248294616-248295578|-|OR2T13"),
...4 = c("ZNF672", "MIR3124", "SH3BP5L", "OR2M7", "OR2T13")),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-5L))
library(tidyr)
original %>%
separate_rows(nearestTargetGeneAnchorB, sep = ",") %>%
mutate(variable = sub(".*\\|", "", nearestTargetGeneAnchorB)) %>%
select(-nearestTargetGeneAnchorB)
# A tibble: 5 × 3
id distance variable
<dbl> <dbl> <chr>
1 1 0 ZNF672
2 1 0 MIR3124
3 1 0 SH3BP5L
4 2 167155 OR2M7
5 2 167155 OR2T12