I am currently developing a statistical package in Python using R code as a reference, and I've noticed different results between the two programs when calculating a robust covariance matrix in Python versus R.
Using the following code with equivalent input data (x
) in Python:
# dummy data
x = np.array([
[0.5488135, 0.71518937, 0.60276338, 0.54488318],
[0.4236548, 0.64589411, 0.43758721, 0.891773],
[0.96366276, 0.38344152, 0.79172504, 0.52889492],
[0.56804456, 0.92559664, 0.07103606, 0.0871293],
[0.0202184, 0.83261985, 0.77815675, 0.87001215],
[0.97861834, 0.79915856, 0.46147936, 0.78052918],
[0.11827443, 0.63992102, 0.14335329, 0.94466892],
[0.52184832, 0.41466194, 0.26455561, 0.77423369],
[0.45615033, 0.56843395, 0.0187898, 0.6176355],
[0.61209572, 0.616934, 0.94374808, 0.6818203],
[0.3595079, 0.43703195, 0.6976312, 0.06022547],
[0.66676672, 0.67063787, 0.21038256, 0.1289263],
[0.31542835, 0.36371077, 0.57019677, 0.43860151],
[0.98837384, 0.10204481, 0.20887676, 0.16130952],
[0.65310833, 0.2532916, 0.46631077, 0.24442559],
[0.15896958, 0.11037514, 0.65632959, 0.13818295],
[0.19658236, 0.36872517, 0.82099323, 0.09710128],
[0.83794491, 0.09609841, 0.97645947, 0.4686512],
[0.97676109, 0.60484552, 0.73926358, 0.03918779],
[0.28280696, 0.12019656, 0.2961402, 0.11872772]
])
# fit mcd
mcd = MinCovDet().fit(x)
# define robust covariance variable
x_cov = mcd.covariance_
and in R:
# dummy data
x <- matrix(c(
0.5488135, 0.71518937, 0.60276338, 0.54488318,
0.4236548, 0.64589411, 0.43758721, 0.891773,
0.96366276, 0.38344152, 0.79172504, 0.52889492,
0.56804456, 0.92559664, 0.07103606, 0.0871293,
0.0202184, 0.83261985, 0.77815675, 0.87001215,
0.97861834, 0.79915856, 0.46147936, 0.78052918,
0.11827443, 0.63992102, 0.14335329, 0.94466892,
0.52184832, 0.41466194, 0.26455561, 0.77423369,
0.45615033, 0.56843395, 0.0187898, 0.6176355,
0.61209572, 0.616934, 0.94374808, 0.6818203,
0.3595079, 0.43703195, 0.6976312, 0.06022547,
0.66676672, 0.67063787, 0.21038256, 0.1289263,
0.31542835, 0.36371077, 0.57019677, 0.43860151,
0.98837384, 0.10204481, 0.20887676, 0.16130952,
0.65310833, 0.2532916, 0.46631077, 0.24442559,
0.15896958, 0.11037514, 0.65632959, 0.13818295,
0.19658236, 0.36872517, 0.82099323, 0.09710128,
0.83794491, 0.09609841, 0.97645947, 0.4686512,
0.97676109, 0.60484552, 0.73926358, 0.03918779,
0.28280696, 0.12019656, 0.2961402, 0.11872772
), nrow = 20, ncol = 4, byrow = TRUE)
# fit mcd
x.mcd <- covMcd(x)
# define robust covariance variable
x_cov <- x.mcd$cov
I get very different values for x_cov
. I am aware that there is stochasticity involved in each of these algorithms, but the differences are way too large to be attributed to that.
For example, in Python:
x_cov = array([[ 0.06669275, 0.01987514, 0.01294049, 0.0235569 ],
[ 0.01987514, 0.0421388 , -0.00541365, 0.0462657 ],
[ 0.01294049, -0.00541365, 0.06601437, -0.02931285],
[ 0.0235569 , 0.0462657 , -0.02931285, 0.08961389]])
and in R:
x_cov = [,1] [,2] [,3] [,4]
[1,] 0.15762177 0.01044705 -0.04043184 -0.01187968
[2,] 0.01044705 0.09957141 0.03036312 0.08703770
[3,] -0.04043184 0.03036312 0.06045952 -0.01781794
[4,] -0.01187968 0.08703770 -0.01781794 0.16634435
Perhaps I'm missing something here, but I can't seem to figure out the discrepancy. Have you run into something similar or do you perhaps have any suggestions for me? Thanks!
The software are not identical and so simply passing your data with the default arguments is not likely to produce identical results. The R function has extensive documentation of its arguments and default values but the Python function provides much less detail and so you may have to consult its source code to be absolutely sure of the cause. However, the parameters affecting estimation include at least the following, depending upon the R algorithm employed and your data (i.e. if the max steps are reached while solving):
Python
R
As you can see, these do not clearly map to each other and this is especially problematic as the Python documentation omits equations needed to directly compare methods for e.g. determining subset size (at least without reading multiple articles).
What I'd recommend is pinpointing when the divergence arises. You said that it's not the centering (although I'd check this manually with pre-centered data). With MCD there appears to be a few possible points of divergence:
Sample selection: the method for determining which observations are used
Weighting: how those observations are weighted
Corrections: finite sample and consistency correction method
Algorithms: instructions passed to solver, method for solving the problem (e.g. "Fast MCD" vs. "DetMcd" in R)
The easiest starting place is to check the observations used. This is best
in R and it looks like it is support
in Python. If different observations were used, then you need to look at the number and size of samples checked for differences. If the same observations were used then I'd look at the raw estimates before re-weighting and correction. Python contains this (with raw_location_
and raw_covariance
), and you appear able to either directly extract this in R (with raw.center
and raw.cov
). The R values for these may be post-correction (I doubt it), so you can also use raw.cnp2
to "uncorrect" the estimates if needed. If the estimates match then it's got to be the weighting or corrections.