We have 2 DNA sequences (strings):
>1
ATGCAT
135198
>2
ATCAT
Expected output: first, we need to align these 2 strings, then get relevant annotation by index:
ATGCAT
AT-CAT
13-198
First part can be done using Biostrings package:
library(Biostrings)
p <- DNAString("ATCAT")
s <- DNAString("ATGCAT")
s_annot <- "135198"
x <- pairwiseAlignment(pattern = p, subject = s)
aligned(x)
# A DNAStringSet instance of length 1
# width seq
# [1] 6 AT-CAT
as.character(x)
# [1] "AT-CAT"
as.matrix(x)
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] "A" "T" "-" "C" "A" "T"
Current workaround for the 2nd part:
annot <- unlist(strsplit(s_annot, ""))
annot[ which(c(as.matrix(x)) == "-") ] <- "-"
# [1] "1" "3" "-" "1" "9" "8"
Works fine, but I was wondering if there is a Biostrings way of doing it (or any other packages), maybe by keeping the annotation in the metadata slot, then after aligning we get the matching annotation for matched bases in metadata, something like below:
getSlots("DNAString")
# shared offset length elementMetadata metadata
# "SharedRaw" "integer" "integer" "DataTable_OR_NULL" "list"
# just an idea, non-working code
s@metadata <- unlist(strsplit(s_annot , ""))
x <- pairwiseAlignment(pattern = p, subject = s)
metadata(x)
# [[1]]
# [1] "1" "3" "-" "1" "9" "8"
Note:
Answer from - https://www.biostars.org/p/415896/#9564298
This just popped up again in my inbox, so thought I'd come back with the solution I came up with.
I believe I arrived at the eventual solution based on an idea from Devon Ryan.
It's embedded in to a much more convoluted codebase, but the general idea was this:
Code here:
https://github.com/jrjhealey/ISIS/blob/master/ImmunoRender.py#L35-L103