rperformancechecksumcheck-digit

Efficient implementation of the GTIN-13 algorithm


I am looking for an efficient way to implement the GTIN-13 check digit algorithm. I have looked at some relevant SO posts such as this and this but it seems like efficiency was not the subject of attention in either case.

Briefly, the algorithm takes a numeric string (such as 123765) and multiples every other digit (from right to left) by 1 or 3 to calculate a sum (so 5 * 1 + 6 * 3 + 7 * 1 + 3 * 3 + 2 * 1 + 1 * 3 = 44) and then subtracts this sum from the closest multiple of 10 that is equal to or greater to this sum (in this case 50 - 44 = 6) to derive the final check digit (here, 6). The input is expected to be 12 digits long, but if shorter, it can be simply padded with zeros from the left (so 123765 is really expected as 000000123765) but the result will be still the same.

A naive implementation of this would be as follows:

gtin13 <- function(n) {
  s <- as.character(n)
  check.sum <- 0
  for (i in 1:nchar(s)) {
    digit <- substr(s, nchar(s) - i + 1, nchar(s) - i + 1)
    check.sum <- check.sum + as.numeric(digit) * ifelse(i %% 2, 1, 3)
  }
  10 - check.sum %% 10
}

However, this is inefficient because of the for loop as well as the conversion to a string and back to a number. For instance:

df <- data.frame(
  num <- sample(1:1000000, 100000, T)
)
system.time(cd <- vapply(df$num, gtin13, 0))

Take about 6 seconds on an average desktop.

What is a more efficient to calculate this check.sum?


Solution

  • This version doesn't need the vapply so it's faster because we don't loop over the number of possible digits in R. For example

    gtim13_vec <- function(x) {
      d <- x %% 10
      for(i in 1:12) { # Input can be up to 12 digits
        d <- d +(x%/% 10^i %% 10) * c(1,3)[1+i%%2]
      }
      d
      10-(d%%10)
    }
    

    I used set.seed(7) for this experiment. I see

    system.time(r1 <- vapply(df$num, gtim13, 0))
    #    user  system elapsed 
    #    3.21    0.00    3.36 
    system.time(r2 <- gtim13_vec(df$num))
    #    user  system elapsed 
    #    0.03    0.00    0.03 
    all(r1==r2)
    # [1] TRUE
    

    So there's a big speed improvement.