rkolmogorov-smirnov

KS-test between each columns of a matrix in R


I was wondering if there is a package or a new way that can perform KS-test between each columns of a matrix more quickly than use 2 loops? Thanks!

all <- c()
for (i in (1:(ncol(a)-1))){
  for (j in ((i+1):ncol(a))){
    res <- ks.test(i,j)
    all <- rbind(all,res)
  }
}

Solution

  • Suppose we have a 5 column matrix, each containing 10 draws from a different normal distribution:

    set.seed(1)
    
    a <- sapply(seq(1, 3, 0.5), function(x) rnorm(10, x))
    
    a
    #>            [,1]       [,2]      [,3]     [,4]     [,5]
    #>  [1,] 0.3735462  3.0117812 2.9189774 3.858680 2.835476
    #>  [2,] 1.1836433  1.8898432 2.7821363 2.397212 2.746638
    #>  [3,] 0.1643714  0.8787594 2.0745650 2.887672 3.696963
    #>  [4,] 2.5952808 -0.7146999 0.0106483 2.446195 3.556663
    #>  [5,] 1.3295078  2.6249309 2.6198257 1.122940 2.311244
    #>  [6,] 0.1795316  1.4550664 1.9438713 2.085005 2.292505
    #>  [7,] 1.4874291  1.4838097 1.8442045 2.105710 3.364582
    #>  [8,] 1.7383247  2.4438362 0.5292476 2.440687 3.768533
    #>  [9,] 1.5757814  2.3212212 1.5218499 3.600025 2.887654
    #> [10,] 0.6946116  2.0939013 2.4179416 3.263176 3.881108
    

    We can get all unique combinations of 2 columns using combn and apply ks.test to these pairs of columns, retreiving all the p values in a single vector like this:

    p <- apply(combn(ncol(a), 2), 2, function(x) ks.test(a[,x[1]], a[, x[2]])$p.val)
    
    p
    #>  [1] 0.1678213427 0.0524475524 0.0020567668 0.0002165018 0.9944575548
    #>  [6] 0.4175236528 0.0123406006 0.1678213427 0.0524475524 0.4175236528
    

    To make it clearer which comparison these p values refer to, we can store the results in a 5 x 5 matrix that will be the same as the result of your loop. Note though, that we have only had to run ks.test 10 times instead of 25 because the diagonal is already known to be p = 1, and the matrix is symmetrical:

    m <- matrix(0, ncol(a), ncol(a))
    
    m[lower.tri(m)] <- round(p, 3)
    m <- t(m)
    m[lower.tri(m)] <- round(p, 3)
    diag(m) <- 1
    
    m
    #>       [,1]  [,2]  [,3]  [,4]  [,5]
    #> [1,] 1.000 0.168 0.052 0.002 0.000
    #> [2,] 0.168 1.000 0.994 0.418 0.012
    #> [3,] 0.052 0.994 1.000 0.168 0.052
    #> [4,] 0.002 0.418 0.168 1.000 0.418
    #> [5,] 0.000 0.012 0.052 0.418 1.000
    

    Alternatively, you could give p the names of the columns compared:

    names(p) <- apply(combn(ncol(a), 2), 2, paste, collapse = "-")
    
    p
    #>          1-2          1-3          1-4          1-5          2-3 
    #> 0.1678213427 0.0524475524 0.0020567668 0.0002165018 0.9944575548 
    #>          2-4          2-5          3-4          3-5          4-5 
    #> 0.4175236528 0.0123406006 0.1678213427 0.0524475524 0.4175236528 
    

    Created on 2022-09-10 with reprex v2.0.2