apply a function with two arguments, rowwise on a data frame

I have a reference and query sequences:

ref_seq <- "ATTT"
df <- data.frame(V1=c("AATT", "TTTT", "GGTT"))

I would like to return the mismatch positions in the sequence for each query compared to reference:

seqdiff <- function(seq1, seq2) {
  seq <- strsplit(c(seq1, seq2), split= '')
  mismatches <- which(seq[[1]] != seq[[2]])
  return(mismatches)
}
    
apply(X=df, MARGIN=2, function(x) seqdiff(x, ref_seq))

#      V1
# [1,]  1
# [2,]  2

Expected Output:

#      V1
# [1,]  2
# [2,]  1
# [3,]  1 2

Solution

Given these are likely nucleotide sequences, you might consider the adist function. This can be used in other cases to determine the minimal weighted number of insertions, deletions, and substitutions needed to transform one string into another. This allows for counts to be computed in transformation, as well as the transformation sequence in the "trafos" attribute (M = match, I = insertion, D = deletion, S = Substitution).

df$trafos <- attr(adist(df$V1, ref_seq, counts = TRUE), "trafos")
df$substitution <- gregexpr("S", df$trafos)
df

    V1 trafos substitution
1 AATT   MSMM            2
2 TTTT   SMMM            1
3 GGTT   SSMM         1, 2

As noted in the comments by @AndreWildberg, different contexts may require different strategies. Other packages such as through Bioconductor or bedtoolsr may be more appropriate depending on need.

If using adist, you can also specify the "cost" argument, to apply costs for insertions, deletions, or substitutions when computing string distance.

In the case of ref_seq <- "GAAA" for example, you could apply costs for insertions and deletions, which would give a different (and probably more desirable) answer in this case.

df$trafos <- attr(adist(df$V1, 
                        ref_seq, 
                        counts = TRUE, 
                        costs = list(ins = 1, del = 1, sub = 0)), 
                  "trafos")
df$substitution <- gregexpr("S", df$trafos)
df

    V1 trafos substitution
1 AATT   SMSS      1, 3, 4
2 TTTT   SSSS   1, 2, 3, 4
3 GGTT   MSSS      2, 3, 4

In case there is interest in Bioconductor (Biostrings) to look at alignment and mismatches between two sequences, I'm adding a quick example.

library(Biostrings)
library(pwalign)

alg <- pairwiseAlignment("AATT", "GAAA")
mismatchTable(alg)

  PatternId PatternStart PatternEnd PatternSubstring PatternQuality SubjectStart SubjectEnd SubjectSubstring SubjectQuality
1         1            1          1                A              7            1          1                G              7
2         1            3          3                T              7            3          3                A              7
3         1            4          4                T              7            4          4                A              7