I have a reference and query sequences:
ref_seq <- "ATTT"
df <- data.frame(V1=c("AATT", "TTTT", "GGTT"))
I would like to return the mismatch positions in the sequence for each query compared to reference:
seqdiff <- function(seq1, seq2) {
seq <- strsplit(c(seq1, seq2), split= '')
mismatches <- which(seq[[1]] != seq[[2]])
return(mismatches)
}
apply(X=df, MARGIN=2, function(x) seqdiff(x, ref_seq))
# V1
# [1,] 1
# [2,] 2
Expected Output:
# V1
# [1,] 2
# [2,] 1
# [3,] 1 2
Given these are likely nucleotide sequences, you might consider the adist
function. This can be used in other cases to determine the minimal weighted number of insertions, deletions, and substitutions needed to transform one string into another. This allows for counts to be computed in transformation, as well as the transformation sequence in the "trafos" attribute (M = match, I = insertion, D = deletion, S = Substitution).
df$trafos <- attr(adist(df$V1, ref_seq, counts = TRUE), "trafos")
df$substitution <- gregexpr("S", df$trafos)
df
V1 trafos substitution
1 AATT MSMM 2
2 TTTT SMMM 1
3 GGTT SSMM 1, 2
As noted in the comments by @AndreWildberg, different contexts may require different strategies. Other packages such as through Bioconductor or bedtoolsr
may be more appropriate depending on need.
If using adist
, you can also specify the "cost" argument, to apply costs for insertions, deletions, or substitutions when computing string distance.
In the case of ref_seq <- "GAAA"
for example, you could apply costs for insertions and deletions, which would give a different (and probably more desirable) answer in this case.
df$trafos <- attr(adist(df$V1,
ref_seq,
counts = TRUE,
costs = list(ins = 1, del = 1, sub = 0)),
"trafos")
df$substitution <- gregexpr("S", df$trafos)
df
V1 trafos substitution
1 AATT SMSS 1, 3, 4
2 TTTT SSSS 1, 2, 3, 4
3 GGTT MSSS 2, 3, 4
In case there is interest in Bioconductor (Biostrings) to look at alignment and mismatches between two sequences, I'm adding a quick example.
library(Biostrings)
library(pwalign)
alg <- pairwiseAlignment("AATT", "GAAA")
mismatchTable(alg)
PatternId PatternStart PatternEnd PatternSubstring PatternQuality SubjectStart SubjectEnd SubjectSubstring SubjectQuality
1 1 1 1 A 7 1 1 G 7
2 1 3 3 T 7 3 3 A 7
3 1 4 4 T 7 4 4 A 7