pythonrbioinformaticsbiopythonbioconductor

'position-aware' aligning of sequences with letter annotations


We have 2 DNA sequences (strings):

>1
ATGCAT
135198
>2
ATCAT

Expected output: first, we need to align these 2 strings, then get relevant annotation by index:

ATGCAT
AT-CAT
13-198

First part can be done using Biostrings package:

library(Biostrings)

p <- DNAString("ATCAT")
s <- DNAString("ATGCAT")
s_annot <- "135198"

x <- pairwiseAlignment(pattern = p, subject = s)

aligned(x)
# A DNAStringSet instance of length 1
#     width seq
# [1]     6 AT-CAT
as.character(x)
# [1] "AT-CAT"
as.matrix(x)
#     [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] "A"  "T"  "-"  "C"  "A"  "T" 

Current workaround for the 2nd part:

annot <- unlist(strsplit(s_annot, ""))
annot[ which(c(as.matrix(x)) == "-") ] <- "-"
# [1] "1" "3" "-" "1" "9" "8"

Works fine, but I was wondering if there is a Biostrings way of doing it (or any other packages), maybe by keeping the annotation in the metadata slot, then after aligning we get the matching annotation for matched bases in metadata, something like below:

getSlots("DNAString")
#      shared              offset              length     elementMetadata            metadata 
# "SharedRaw"           "integer"           "integer" "DataTable_OR_NULL"              "list" 

# just an idea, non-working code
s@metadata <- unlist(strsplit(s_annot , ""))
x <- pairwiseAlignment(pattern = p, subject = s)
metadata(x)
# [[1]]
# [1] "1" "3" "-" "1" "9" "8"

Note:


Solution

  • Answer from - https://www.biostars.org/p/415896/#9564298


    This just popped up again in my inbox, so thought I'd come back with the solution I came up with.

    I believe I arrived at the eventual solution based on an idea from Devon Ryan.

    It's embedded in to a much more convoluted codebase, but the general idea was this:

    1. Create a pairwise alignment of the sequences as normal.
    2. Iterate through the alignment and create a sort of CIGAR string which captures the various sequence 'movements'.
    3. Apply/map this same set of movements to the list of annotations corresponding to each position, thus repeating the set of operations

    Code here:

    https://github.com/jrjhealey/ISIS/blob/master/ImmunoRender.py#L35-L103