pythonrstatistics

Discrepancy between Python and R calculation of a robust covariance matrix


I am currently developing a statistical package in Python using R code as a reference, and I've noticed different results between the two programs when calculating a robust covariance matrix in Python versus R.

Using the following code with equivalent input data (x) in Python:

# dummy data
x = np.array([
    [0.5488135, 0.71518937, 0.60276338, 0.54488318],
    [0.4236548, 0.64589411, 0.43758721, 0.891773],
    [0.96366276, 0.38344152, 0.79172504, 0.52889492],
    [0.56804456, 0.92559664, 0.07103606, 0.0871293],
    [0.0202184, 0.83261985, 0.77815675, 0.87001215],
    [0.97861834, 0.79915856, 0.46147936, 0.78052918],
    [0.11827443, 0.63992102, 0.14335329, 0.94466892],
    [0.52184832, 0.41466194, 0.26455561, 0.77423369],
    [0.45615033, 0.56843395, 0.0187898, 0.6176355],
    [0.61209572, 0.616934, 0.94374808, 0.6818203],
    [0.3595079, 0.43703195, 0.6976312, 0.06022547],
    [0.66676672, 0.67063787, 0.21038256, 0.1289263],
    [0.31542835, 0.36371077, 0.57019677, 0.43860151],
    [0.98837384, 0.10204481, 0.20887676, 0.16130952],
    [0.65310833, 0.2532916, 0.46631077, 0.24442559],
    [0.15896958, 0.11037514, 0.65632959, 0.13818295],
    [0.19658236, 0.36872517, 0.82099323, 0.09710128],
    [0.83794491, 0.09609841, 0.97645947, 0.4686512],
    [0.97676109, 0.60484552, 0.73926358, 0.03918779],
    [0.28280696, 0.12019656, 0.2961402, 0.11872772]
])

# fit mcd
mcd = MinCovDet().fit(x)

# define robust covariance variable
x_cov = mcd.covariance_

and in R:

# dummy data
x <- matrix(c(
  0.5488135, 0.71518937, 0.60276338, 0.54488318,
  0.4236548, 0.64589411, 0.43758721, 0.891773,
  0.96366276, 0.38344152, 0.79172504, 0.52889492,
  0.56804456, 0.92559664, 0.07103606, 0.0871293,
  0.0202184, 0.83261985, 0.77815675, 0.87001215,
  0.97861834, 0.79915856, 0.46147936, 0.78052918,
  0.11827443, 0.63992102, 0.14335329, 0.94466892,
  0.52184832, 0.41466194, 0.26455561, 0.77423369,
  0.45615033, 0.56843395, 0.0187898, 0.6176355,
  0.61209572, 0.616934, 0.94374808, 0.6818203,
  0.3595079, 0.43703195, 0.6976312, 0.06022547,
  0.66676672, 0.67063787, 0.21038256, 0.1289263,
  0.31542835, 0.36371077, 0.57019677, 0.43860151,
  0.98837384, 0.10204481, 0.20887676, 0.16130952,
  0.65310833, 0.2532916, 0.46631077, 0.24442559,
  0.15896958, 0.11037514, 0.65632959, 0.13818295,
  0.19658236, 0.36872517, 0.82099323, 0.09710128,
  0.83794491, 0.09609841, 0.97645947, 0.4686512,
  0.97676109, 0.60484552, 0.73926358, 0.03918779,
  0.28280696, 0.12019656, 0.2961402, 0.11872772
), nrow = 20, ncol = 4, byrow = TRUE)

# fit mcd
x.mcd <- covMcd(x)

# define robust covariance variable
x_cov <- x.mcd$cov

I get very different values for x_cov. I am aware that there is stochasticity involved in each of these algorithms, but the differences are way too large to be attributed to that.

For example, in Python:

x_cov = array([[ 0.06669275,  0.01987514,  0.01294049,  0.0235569 ],
              [ 0.01987514,  0.0421388 , -0.00541365,  0.0462657 ],
              [ 0.01294049, -0.00541365,  0.06601437, -0.02931285],
              [ 0.0235569 ,  0.0462657 , -0.02931285,  0.08961389]])

and in R:

x_cov =             [,1]       [,2]        [,3]        [,4]
          [1,]  0.15762177 0.01044705 -0.04043184 -0.01187968
          [2,]  0.01044705 0.09957141  0.03036312  0.08703770
          [3,] -0.04043184 0.03036312  0.06045952 -0.01781794
          [4,] -0.01187968 0.08703770 -0.01781794  0.16634435

Perhaps I'm missing something here, but I can't seem to figure out the discrepancy. Have you run into something similar or do you perhaps have any suggestions for me? Thanks!


Solution

  • The software are not identical and so simply passing your data with the default arguments is not likely to produce identical results. The R function has extensive documentation of its arguments and default values but the Python function provides much less detail and so you may have to consult its source code to be absolutely sure of the cause. However, the parameters affecting estimation include at least the following, depending upon the R algorithm employed and your data (i.e. if the max steps are reached while solving):

    Python

    R

    As you can see, these do not clearly map to each other and this is especially problematic as the Python documentation omits equations needed to directly compare methods for e.g. determining subset size (at least without reading multiple articles).

    What I'd recommend is pinpointing when the divergence arises. You said that it's not the centering (although I'd check this manually with pre-centered data). With MCD there appears to be a few possible points of divergence:

    1. Sample selection: the method for determining which observations are used

    2. Weighting: how those observations are weighted

    3. Corrections: finite sample and consistency correction method

    4. Algorithms: instructions passed to solver, method for solving the problem (e.g. "Fast MCD" vs. "DetMcd" in R)

    The easiest starting place is to check the observations used. This is best in R and it looks like it is support in Python. If different observations were used, then you need to look at the number and size of samples checked for differences. If the same observations were used then I'd look at the raw estimates before re-weighting and correction. Python contains this (with raw_location_ and raw_covariance), and you appear able to either directly extract this in R (with raw.center and raw.cov). The R values for these may be post-correction (I doubt it), so you can also use raw.cnp2 to "uncorrect" the estimates if needed. If the estimates match then it's got to be the weighting or corrections.