rdataframegrepl

Creating a new column by pasting all columns matching part of a string in R


I have a dataframe that was read in from a file which was tab separated but had within it a column which was semi-colon separated. This column contains most of my actual variables of interest however it is not sorted as some rows contain more information than others, and some rows have duplicate values. The variables of interest do however contain an identifier as pat of their string e.g. "gene eno".

For each row I would like to identify and paste together all values where there is a match for a given identifier as below:

Current dataframe:

Column A V9_01 V9_02
CDS 1 Index123 gene "pla"
CDS 2 gene "dah"
CDS 3 gene "blah" Location:456
CDS 4 gene "do" gene "rah"
CDS 5 Index127 Location893

Desired dataframe:

Column A V9_01 V9_02 Gene_Name
CDS 1 Index123 gene "pla" gene "pla"
CDS 2 gene "dah" gene "dah"
CDS 3 gene "blah" Location:456 gene "blah"
CDS 4 gene "do" gene "rah" gene "do", gene"rah"
CDS 5 Index127 Location893 NA

I have made the current dataframe using the following code to read in the original file:

DP_GTF<-read.delim("E:/Genome_Files/GTF/DolosPig51524.gtf", sep = "\t", comment.char = "#", header = F) %>% 
  subset(V3=="CDS") %>% 
  #select(c("V9"))%>% 
  cSplit("V9",";")

I'm not sure how to get my desired dataframe but assume I need to run grep over part of the dataframe?


Solution

  • An approach with base R using grep, searching for gene

    transform(df, Gene_Name = apply(df[,-1], 1, \(x){
                                res <- toString(grep("gene", x, value=T))
                                replace(res, res == "", NA)}), check.names=F)
      Column A     V9_01        V9_02         Gene_Name
    1    CDS 1  Index123     gene pla          gene pla
    2    CDS 2               gene dah          gene dah
    3    CDS 3 gene blah Location:456         gene blah
    4    CDS 4   gene do     gene rah gene do, gene rah
    5    CDS 5  Index127  Location893              <NA>
    

    With dplyr using c_across with rowwise

    library(dplyr)
    
    df %>% 
      rowwise() %>% 
      mutate(Gene_Name = toString(grep("gene", c_across(V9_01:V9_02), value=T)),
             Gene_Name = replace(Gene_Name, Gene_Name == "", NA)) %>%
      ungroup()
    # A tibble: 5 × 4
      `Column A` V9_01       V9_02        Gene_Name        
      <chr>      <chr>       <chr>        <chr>            
    1 CDS 1      "Index123"  gene pla     gene pla         
    2 CDS 2      ""          gene dah     gene dah         
    3 CDS 3      "gene blah" Location:456 gene blah        
    4 CDS 4      "gene do"   gene rah     gene do, gene rah
    5 CDS 5      "Index127"  Location893  NA