optimizationrvectorizationcheck-digit

Optimizing the Verhoeff Algorithm in R


I have written the following function to calculate a check digit in R.

verhoeffCheck <- function(x)
{
## calculates check digit based on Verhoeff algorithm
## note that due to the way strsplit works, to call for vector x, use sapply(x,verhoeffCheck)

## check for string since leading zeros with numbers will be lost
if (class(x)!="character"){stop("Must enter a string")}

#split and convert to numbers
digs <- strsplit(x,"")[[1]]
digs <- as.numeric(digs)

digs <- rev(digs)   ## right to left algorithm

## tables required for D_5 group

d5_mult <- matrix(c(
                 0:9,
                 c(1:4,0,6:9,5),
                 c(2:4,0:1,7:9,5:6),
                 c(3:4,0:2,8:9,5:7),
                 c(4,0:3,9,5:8),
                 c(5,9:6,0,4:1),
                 c(6:5,9:7,1:0,4:2),
                 c(7:5,9:8,2:0,4:3),
                 c(8:5,9,3:0,4),
                 9:0
                 ),10,10,byrow=T)

d5_perm <- matrix(c(
                 0:9,
                 c(1,5,7,6,2,8,3,0,9,4),
                 c(5,8,0,3,7,9,6,1,4,2),
                 c(8,9,1,6,0,4,3,5,2,7),
                 c(9,4,5,3,1,2,6,8,7,0),
                 c(4,2,8,6,5,7,3,9,0,1),
                 c(2,7,9,3,8,0,6,4,1,5),
                 c(7,0,4,6,9,1,3,2,5,8)
                 ),8,10,byrow=T)

d5_inv <- c(0,4:1,5:9)

## apply algoritm - note 1-based indexing in R
d <- 0

for (i in 1:length(digs)){
    d <- d5_mult[d+1,(d5_perm[(i%%8)+1,digs[i]+1])+1]
    }

d5_inv[d+1]
}

In order to run on a vector of strings, sapply must be used. This is in part because of the use of strsplit, which returns a list of vectors. This does impact on the performance even for only moderately sized inputs.

How could this function be vectorized?

I am also aware that some performance is lost in having to create the tables in each iteration. Would storing these in a new environment be a better solution?


Solution

  • If your input strings can contain different numbers of characters, then I don't see any way round lapply calls (or a plyr equivalent). The trick is to move them inside the function, so verhoeffCheck can accept vector inputs. This way you only need to create the matrices once.

    verhoeffCheckNew <- function(x)
    {
    ## calculates check digit based on Verhoeff algorithm
    
    ## check for string since leading zeros with numbers will be lost
      if (!is.character(x)) stop("Must enter a string")
    
      #split and convert to numbers
      digs <- strsplit(x, "")
      digs <- lapply(digs, function(x) rev(as.numeric(x)))
    
      ## tables required for D_5 group
      d5_mult <- matrix(c(
                       0:9,
                       c(1:4,0,6:9,5),
                       c(2:4,0:1,7:9,5:6),
                       c(3:4,0:2,8:9,5:7),
                       c(4,0:3,9,5:8),
                       c(5,9:6,0,4:1),
                       c(6:5,9:7,1:0,4:2),
                       c(7:5,9:8,2:0,4:3),
                       c(8:5,9,3:0,4),
                       9:0
                       ),10,10,byrow=T)
    
      d5_perm <- matrix(c(
                       0:9,
                       c(1,5,7,6,2,8,3,0,9,4),
                       c(5,8,0,3,7,9,6,1,4,2),
                       c(8,9,1,6,0,4,3,5,2,7),
                       c(9,4,5,3,1,2,6,8,7,0),
                       c(4,2,8,6,5,7,3,9,0,1),
                       c(2,7,9,3,8,0,6,4,1,5),
                       c(7,0,4,6,9,1,3,2,5,8)
                       ),8,10,byrow=T)
    
      d5_inv <- c(0,4:1,5:9)
    
      ## apply algorithm - note 1-based indexing in R      
      sapply(digs, function(x)
      {
        d <- 0  
        for (i in 1:length(x)){
            d <- d5_mult[d + 1, (d5_perm[(i %% 8) + 1, x[i] + 1]) + 1]
            }  
        d5_inv[d+1]
      })
    }
    

    Since d depends on what it was previously, the is no easy way to vectorise the for loop.

    My version runs in about half the time for 1e5 strings.

    rand_string <- function(n = 12) 
    {
      paste(sample(as.character(0:9), sample(n), replace = TRUE), collapse = "")
    }
    big_test <- replicate(1e5, rand_string())
    
    tic()
    res1 <- unname(sapply(big_test, verhoeffCheck))
    toc()
    
    tic()
    res2 <- verhoeffCheckNew(big_test)
    toc()
    
    identical(res1, res2) #hopefully TRUE!
    

    See this question for tic and toc.

    Further thoughts:

    You may want additional input checking for "" and other strings that return NA when converted in numeric.

    Since you are dealing exclusively with integers, you may get a slight performance benefit from using them rather than doubles. (Use as.integer rather than as.numeric and append L to the values in your matrices.)