Given three strings:
seq <- c("abcd", "bcde", "cdef", "af", "cdghi")
I would like to do multiple sequence alignment so that I get the following result:
abcd
bcde
cdef
a f
cd ghi
Using the msa() function from the msa package I tried:
msa(seq, type = "protein", order = "input", method = "Muscle")
and got the following result:
aln names
[1] ABCD--- Seq1
[2] -BCDE-- Seq2
[3] --CD-EF Seq3
[4] -----AF Seq4
[5] --CDGHI Seq5
Con --CD-?? Consensus
I would like to use this function for sequences that can contain any unicode characters, but already in this example the function gives a warning: invalid letters found. Any ideas?
Here's a solution in base R that outputs a table:
seq <- c("abcd", "bcde", "cdef", "af", "cdghi")
all_chars <- unique(unlist(strsplit(seq, "")))
tab <- t(apply(do.call(rbind, lapply(strsplit(seq, ""),
function(x) table(factor(x, all_chars)))), 1,
function(x) ifelse(x == 1, all_chars, " ")))
We can print the output without quotes to see it more clearly:
print(tab, quote = FALSE)
#> a b c d e f g h i
#> [1,] a b c d
#> [2,] b c d e
#> [3,] c d e f
#> [4,] a f
#> [5,] c d g h i
Created on 2022-05-25 by the reprex package (v2.0.1)