rstringdataframetextsplit

How to extract strings after specific symbols in one column and separate to multiple rows?


I have data that contains the nearest gene sets, including their genomic region and strand, in one column. I want to make a new column for the single gene extracted from that column and separate them into multiple rows. Gene names are located after |+| or |-|.

Original data

original <- structure(list(id = c(1, 2), distance = c(0, 167155), 
    nearestTargetGeneAnchorB = c("chr1:248838178-248849517|+|ZNF672,chr1:248826377-248826443|+|MIR3124,chr1:248810452-248825955|-|SH3BP5L", 
        "chr1:248323630-248324568|-|OR2M7,chr1:248294616-248295578|-|OR2T12")), 
    class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
        -2L))

How can I transform the table to this?

fixed <- structure(list(id = c(1, 1, 1, 2, 2), distance = c(0, 
    0, 0, 167155, 167155), nearestTargetGeneAnchorB = c("chr1:248838178-248849517|+|ZNF672,chr1:248826377-248826443|+|MIR3124,chr1:248810452-248825955|-|SH3BP5L", 
    "chr1:248838178-248849517|+|ZNF672,chr1:248826377-248826443|+|MIR3124,chr1:248810452-248825955|-|SH3BP5L", 
    "chr1:248838178-248849517|+|ZNF672,chr1:248826377-248826443|+|MIR3124,chr1:248810452-248825955|-|SH3BP5L", 
    "chr1:248323630-248324568|-|OR2M7,chr1:248294616-248295578|-|OR2T12", 
    "chr1:248323630-248324568|-|OR2M7,chr1:248294616-248295578|-|OR2T13"), 
    ...4 = c("ZNF672", "MIR3124", "SH3BP5L", "OR2M7", "OR2T13")), 
    class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
        -5L))

Solution

  • library(tidyr)
    
    original %>% 
      separate_rows(nearestTargetGeneAnchorB, sep = ",") %>% 
      mutate(variable = sub(".*\\|", "", nearestTargetGeneAnchorB)) %>% 
      select(-nearestTargetGeneAnchorB)
      # A tibble: 5 × 3
         id distance variable
      <dbl>    <dbl> <chr>   
    1     1        0 ZNF672  
    2     1        0 MIR3124 
    3     1        0 SH3BP5L 
    4     2   167155 OR2M7   
    5     2   167155 OR2T12