I implemented my own correlation function in R. Surprisingly I get slightly different results when using the built-in cor
function. The differences seem to disappear when n
the number of observations are big enough.
My function:
corr = function(X) {
Q = X - colMeans(X)
S_ = colSums(Q**2)
S = sqrt(S_ %*% t(S_))
covarr = t(Q) %*% Q
corrr_ = covarr / S
return(corrr_)
}
library(mvtnorm)
set.seed(247)
X = rmvnorm(10, sigma = matrix(c(1,0.8,0.8,1), ncol=2)) # change 10 to 100, 1000, or 10000
corr(X)
cor(X)
For n=10
I get 0.8490966 vs. 0.8465363, so the change is in the 3rd decimal. For n=1000
I get 0.7960206 vs. 0.7960925, so the change is in the 5th decimal.
The first line of the function should be this since R stores matrices column by column and not row by row
Q = t(t(X) - colMeans(X))
or
Q = X - matrix(colMeans(X), nrow(X), ncol(X), byrow = TRUE)
or
Q = scale(X, TRUE, FALSE)
or even this which is not the same Q but in the end gives the same answer
Q = scale(X)
If we use cov2cor from the base of R then
corr = function(X) {
Q = scale(X)
cov2cor(crossprod(Q))
}