rdna-sequencesequence-alignmentstring-algorithm

How to do multiple sequence alignment of text strings (utf8) in R


Given three strings:

seq <- c("abcd", "bcde", "cdef", "af", "cdghi")

I would like to do multiple sequence alignment so that I get the following result:

abcd
 bcde
  cdef
a    f
  cd  ghi

Using the msa() function from the msa package I tried:

msa(seq, type = "protein", order = "input", method = "Muscle")

and got the following result:

    aln     names
 [1] ABCD--- Seq1
 [2] -BCDE-- Seq2
 [3] --CD-EF Seq3
 [4] -----AF Seq4
 [5] --CDGHI Seq5
 Con --CD-?? Consensus   

I would like to use this function for sequences that can contain any unicode characters, but already in this example the function gives a warning: invalid letters found. Any ideas?


Solution

  • Here's a solution in base R that outputs a table:

    seq <- c("abcd", "bcde", "cdef", "af", "cdghi")
    
    all_chars <- unique(unlist(strsplit(seq, "")))
    
    tab <- t(apply(do.call(rbind, lapply(strsplit(seq, ""), 
           function(x) table(factor(x, all_chars)))), 1,
           function(x) ifelse(x == 1, all_chars, " ")))
    

    We can print the output without quotes to see it more clearly:

    print(tab, quote = FALSE)
    #>      a b c d e f g h i
    #> [1,] a b c d          
    #> [2,]   b c d e        
    #> [3,]     c d e f      
    #> [4,] a         f      
    #> [5,]     c d     g h i
    

    Created on 2022-05-25 by the reprex package (v2.0.1)