rbioinformaticsrna-seq

Gene names from vmatchPattern (Biostrings)


I try to get the gene names out of a binding analysis of the 5'UTR. Therefore I have this little code. Until the vmatchPattern everything works fine. At least I hope so.

library(biomaRt)
library(GenomicFeatures)
library(XVector)
library(Biostrings)
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(BSgenome.Mmusculus.UCSC.mm10)

fUTR <- fiveUTRsByTranscript(TxDb.Mmusculus.UCSC.mm10.knownGene)

Mmusculus <- BSgenome.Mmusculus.UCSC.mm10
seqlevelsStyle(Mmusculus) <- 'ensembl'
seqlevelsStyle(fUTR) <- 'ensembl'

Seq <- getSeq(Mmusculus, fUTR)
Pbind <- RNAString('UGUGUGAAHAA')

Match <- vmatchPattern(Pbind, unlist2(Seq), max.mismatch = 0, min.mismatch = 0, with.indels = F, fixed = T, algorithm = 'auto')

Afterwards however I want to get the gene names to create a list in the end and use this in Python for further analysis of a RNAseq experiment. There comes a problem, I think I found so far three different ways on how to potentially do this. However none of them are working for me.

##How to get gene names from the match Pattern
#1
matches <- unlist(Match, recursive = T, use.names = T)
m <- as.matrix(matches)
subseq(genes[rownames(m),], start = m[rownames(m),1], width = 20)

#2
transcripts(TxDb.Mmusculus.UCSC.mm10.knownGene, columns = c('tx_id', 'tx_name', 'gene_id'))

#3
count_index <- countIndex(Match)
wh <- which(count_index > 0)
result_list = list()
for(i in 1: length(wh))  
{
  result_list[[i]] = Views(subject[[wh[i]]], mindex[[wh[i]]])
  
}
names(result_listF) = nm[wh]

I am happy to hear some suggestions and get some help or solution for this problem. I am no Bioinformation by training, so this took me already quite a while to figure this out.


Solution

  • So I found an answer, I hope this helps someone, and there is no mistake somewhere.

    library(BSgenome.Mmusculus.UCSC.mm10)
    library(TxDb.Mmusculus.UCSC.mm10.knownGene)
    library(org.Mm.eg.db)
    
    ##get all 5’ UTR sequences
    fUTR <- fiveUTRsByTranscript(TxDb.Mmusculus.UCSC.mm10.knownGene)
    utr_ul <- unlist(fUTR, use.names = F)
    
    mcols(utr_ul)$tx_id <- rep(as.integer(names(fUTR)), lengths(fUTR))
    utr_ul
    
    tx2gene <- mcols(transcripts(TxDb.Mmusculus.UCSC.mm10.knownGene, columns = c('tx_id', 'tx_name', 'gene_id')))
    tx2gene$gene_id <- as.character(tx2gene$gene_id)
    
    m <- match(mcols(utr_ul)$tx_id, tx2gene$tx_id)
    mcols(utr_ul) <- cbind(mcols(utr_ul), tx2gene[m, -1L, drop = F])
    
    utr5_by_gene <- split(utr_ul, mcols(utr_ul)$gene_id)
    
    seqs <- getSeq(Mmusculus, utr5_by_gene)
    
    
    ##search with motif UGUGUGAAHAA
    motif <- DNAString('TGTGTGAAHAA')
    x <- vmatchPattern(motif, unlist(seqs), fixed = F)
    matches <- unlist(x, recursive = T, use.names = T)
    
    
    ##list all genes with matches
    hits <- mapIds(org.Mm.eg.db, keys = unique(names(matches)), keytype = 'ENTREZID',
                   column = 'SYMBOL', multiVals = 'first')