I have a dataframe that was read in from a file which was tab separated but had within it a column which was semi-colon separated. This column contains most of my actual variables of interest however it is not sorted as some rows contain more information than others, and some rows have duplicate values. The variables of interest do however contain an identifier as pat of their string e.g. "gene eno".
For each row I would like to identify and paste together all values where there is a match for a given identifier as below:
Current dataframe:
Column A | V9_01 | V9_02 |
---|---|---|
CDS 1 | Index123 | gene "pla" |
CDS 2 | gene "dah" | |
CDS 3 | gene "blah" | Location:456 |
CDS 4 | gene "do" | gene "rah" |
CDS 5 | Index127 | Location893 |
Desired dataframe:
Column A | V9_01 | V9_02 | Gene_Name |
---|---|---|---|
CDS 1 | Index123 | gene "pla" | gene "pla" |
CDS 2 | gene "dah" | gene "dah" | |
CDS 3 | gene "blah" | Location:456 | gene "blah" |
CDS 4 | gene "do" | gene "rah" | gene "do", gene"rah" |
CDS 5 | Index127 | Location893 | NA |
I have made the current dataframe using the following code to read in the original file:
DP_GTF<-read.delim("E:/Genome_Files/GTF/DolosPig51524.gtf", sep = "\t", comment.char = "#", header = F) %>%
subset(V3=="CDS") %>%
#select(c("V9"))%>%
cSplit("V9",";")
I'm not sure how to get my desired dataframe but assume I need to run grep over part of the dataframe?
An approach with base R using grep
, searching for gene
transform(df, Gene_Name = apply(df[,-1], 1, \(x){
res <- toString(grep("gene", x, value=T))
replace(res, res == "", NA)}), check.names=F)
Column A V9_01 V9_02 Gene_Name
1 CDS 1 Index123 gene pla gene pla
2 CDS 2 gene dah gene dah
3 CDS 3 gene blah Location:456 gene blah
4 CDS 4 gene do gene rah gene do, gene rah
5 CDS 5 Index127 Location893 <NA>
With dplyr using c_across
with rowwise
library(dplyr)
df %>%
rowwise() %>%
mutate(Gene_Name = toString(grep("gene", c_across(V9_01:V9_02), value=T)),
Gene_Name = replace(Gene_Name, Gene_Name == "", NA)) %>%
ungroup()
# A tibble: 5 × 4
`Column A` V9_01 V9_02 Gene_Name
<chr> <chr> <chr> <chr>
1 CDS 1 "Index123" gene pla gene pla
2 CDS 2 "" gene dah gene dah
3 CDS 3 "gene blah" Location:456 gene blah
4 CDS 4 "gene do" gene rah gene do, gene rah
5 CDS 5 "Index127" Location893 NA