rfor-loopoptimizationnested-loopsneedleman-wunsch

R - nested loop alternatives/optimization


I'm currently trying to implement an algorithm in R that requires to loop through the rows and columns of a matrix and that for every cell it computes a value based on the value of previously computed cells.

Here is the code that does what I said above, it is a part of the Needleman Wunsch algorithm:

    globalSequenceAlignment <- function(seq1, seq2, match, mismatch, gap) {
    
    # splitting the sequences in order to use them as rows and columns names
    seq1_split <- unlist(strsplit(toString(seq1), ""))
    seq2_split <- unlist(strsplit(toString(seq2), ""))
    
    len1 <- length(seq1_split)
    len2 <- length(seq2_split)
    
    # creating the alignment matrix
    alignment_matrix <- matrix(0, nrow = len2+1, ncol = len1+1)
    colnames(alignment_matrix) <- c("-", seq1_split)
    rownames(alignment_matrix) <- c("-", seq2_split)
    
    # filling first row and column of the alignment matrix
    for (i in 2:ncol(alignment_matrix)) {
      alignment_matrix[1,i] <- (alignment_matrix[1,i]+(i-1))*(gap)
    }
    
    for (j in 2:nrow(alignment_matrix)) {
      alignment_matrix[j,1] <- (alignment_matrix[j,1]+(j-1))*(gap)
    }
    
    for (i in 2:ncol(alignment_matrix)) {
      for (j in 2:nrow(alignment_matrix)) {
        
        horizontal_score <- alignment_matrix[j,i-1] + gap
        vertical_score <- alignment_matrix[j-1,i] + gap
        
        if (colnames(alignment_matrix)[i] == rownames(alignment_matrix)[j]) {
          diagonal_score <- alignment_matrix[j-1,i-1] + match
        } else {
          diagonal_score <- alignment_matrix[j-1,i-1] + mismatch
        }
        
        scores <- c(horizontal_score, vertical_score, diagonal_score)
        
        alignment_matrix[j,i] <- max(scores)
        
      }
    }
    
    
    return(alignment_matrix)
  
}

a <- 'GAATC'
b <- 'CATACG'

globalSequenceAlignment(a, b, 10,-5,-4)

Using this code I get the result that I want. The problem is that with matrices with dimensions grater than 500x500 the nested loops become way too slow (running this code with a 500x500 matrix takes more or less 2 minutes).

I know that *apply functions could improve this but I couldn't achieve to use them since for computing each cell it requires that the previous ones have been computed yet.

I was wondering if there is a way to achieve the same result using *apply functions or a way to vectorize this type of code so that it's more rapid in R.


Solution

  • If someone would ever need this I wrote my own solution to this problem using the package Rcpp. The runtime, from about 3 minutes for sequences of 500 characters, is now about 0.3s.

    I post here the code for the part of the two nested loops that you can see in the text of the question, hope that will be useful for someone.

    library(Rcpp)
    
    rcppFunction('IntegerMatrix rcpp_compute_matrices(IntegerMatrix Am, StringMatrix Dm,
                                                      StringVector seq1, StringVector seq2,
                                                      int gap, int miss, int match) {
    
        int nrow = Am.nrow(), ncol = Am.ncol();
    
        for (int i = 1; i < nrow; i++) {
          for (int j = 1; j < ncol; j++) {
            int vertical_score = Am(i-1, j) + gap;
            int horizontal_score = Am(i, j-1) + gap;
            int diagonal_score = 0;
            if (seq1[j-1] == seq2[i-1]) {
              diagonal_score = Am(i-1, j-1) + match;
            }
            else {
              diagonal_score = Am(i-1, j-1) + miss;
            }
    
            IntegerVector score = {vertical_score, horizontal_score, diagonal_score};
    
            int max_score = max(score);
    
            Am(i, j) = max_score;
    
            }
        }
        return Am;
    }')