rinteger-overflow

Calculation of Cliffs delta with very large groups causes integer overflow


In R, I need to calculate Cliffs delta. Here is the formula:

enter image description here

Where xi is an observation in group A, and xj is an observation in group B, and [xi > xj] is 1 if xi > xj is true.

Here is Cliff's (1993) non-mathy explanation:

...each of the n xs in one group is compared with each of the m in the other, and counts are made of how many times the member of the first group is higher and how many times it is lower.

This is then divided by the number of pairwise comparisons, i.e., the product of the number of observations in each group.

Here comes the problem: In my data, Group A has 66208 observations and group B has 228691 observations. 66208*228691 causes integer overflow in R.

set.seed(123)
a <- runif(228691)
b <- runif(66208)

x <- length(a) * length(b)

Warning message:
In length(a) * length(b) : NAs produced by integer overflow

Therefore, I'm not sure if I can trust the result of effsize in calculating Cliffs delta, as it throws a similar warning:

library(effsize)
x <- cliff.delta(a, b)

Warning messages:
1: In n1 * n2 : NAs produced by integer overflow
2: In n1 * n2 : NAs produced by integer overflow

> x$estimate
[1] -0.0005022877

I can probably find a way to implement the calculation without using effsize, but I don't see any way around the issue of needing to divide something by 15141173728 (228691*66208) which I don't know how to do, since it causes integer overflow. In addition, the numerator might also cause integer overflow.

Are there any way that I can be allowed to work with large numbers so I can calculate Cliffs delta in my data?

Edit:

I followed Pbulls advice and found the part of the effsize:cliff.delta code where d is computed. effsize:cliff.delta returns a list of 10, and if the only thing needed i d, there won't be any warnings:

set.seed(123)
vector1 <- runif(228691)
vector2 <- runif(66208)

#Remove comments to replicate subsample sanity check
#vector1 <- vector1[1:1000]
#vector2 <- vector2[1:1000]

.bsearch.partition <- function(x, a, b = 1, e = length(a)) {
  n <- length(x)
  low <- rep(NA, n)
  L <- rep(b, n)
  H <- rep(e, n)
  
  repeat {
    M <- as.integer((L + H) / 2)
    left <- x <= a[M]
    H[left] <- M[left]
    L[!left] <- M[!left] + 1
    if (all(H <= L)) {
      break
    }
  }
  
  H <- L
  repeat {
    below <- a[H] == x
    below[is.na(below)] <- FALSE
    if (!any(below)) 
      break
    H[below] <- H[below] + 1
  }
  
  repeat {
    L.clean <- L
    L.clean[L.clean < 1] <- NA
    above <- a[L.clean] >= x
    above[is.na(above)] <- FALSE
    if (!any(above)) 
      break
    L[above] <- L[above] - 1
  }
  
  if (any(L == H)){
    H[H == L] <- L[H == L] + 1
  }
  H[H > length(a) + 1] <- length(a) + 1
  cbind(below = L, above = H)
}

treatment <- sort(vector1)
control <- sort(vector2)

n1 <- length(treatment)
n2 <- length(control)

partitions <- .bsearch.partition(treatment, control)
partitions[, 2] <- n2 - partitions[, 2] + 1L
partitions[partitions[, 1] > n2, 1] <- n2

d_i. <- mean(partitions %*% c(1L, -1L) / n2)
d <- mean(d_i.)

> d
[1] -0.0005022877

Solution

  • Logarithms are your best friend when dealing with impossibly large multiplications.

    This implementation isn't efficient -- you might do well to copy a better pairwise comparison approach from effsize itself -- but it should be numerically fine up to log(n1) + log(n2) ~ 702 or until the additions overflow:

    log.cliff.delta <- function(x, y) {
       ## bigger <- sum(outer(x, y, `>`))   ## Needs 113 Gb memory
       bigger <- sum(vapply(y, \(yo) sum(x > yo), numeric(1)))
       smaller <- sum(vapply(y, \(yo) sum(x < yo), numeric(1)))
       sign <- c(1, -1)[1+(bigger < smaller)]
       log_d <- log(abs(bigger - smaller)) - log(length(x)) - log(length(y))
       sign*exp(log_d)
    }
    
    ## Subsample sanity check
    effsize::cliff.delta(a[1:1000], b[1:1000])
    #> delta estimate: -0.016404 (negligible)
    
    log.cliff.delta(a[1:1000], b[1:1000])
    #> -0.016404
    
    ## The whole nine yards
    effsize::cliff.delta(a, b)
    #> delta estimate: -0.0005022877 (negligible)
    #> Warning messages:
    #> 1: In n1 * n2 : NAs produced by integer overflow
    #> 2: In n1 * n2 : NAs produced by integer overflow
    
    log.cliff.delta(a, b)   ## is *not* fast
    #> -0.0005022877
    

    From this you can see that the warnings from effsize::cliff.delta weren't as concerning as you might think: it actually implements some clever matrix multiplication and averaging in calculating the delta estimate itself, the n1 * n2 denominator doesn't come into it (and it's much faster than my naïve attempt).

    I cannot speak for the confidence limits however, those may be affected to an unknown extent, but you didn't provide a formula for them & with these sample sizes any confidence interval is likely going to be so narrow as to be quite meaningless anyway.

    Edit: Two additional points came to mind: