I keep getting NAs when trying to find the covariance matrix for the Iris data in R.
library(ggplot2)
library(dplyr)
dim(iris)
head(iris)
numIris <- iris %>%
select_if(is.numeric)
plot(numIris[1:100,])
Xraw <- numIris[1:1000,]
plot(iris[1:150,-c(5)]) #species name is the 5th column; excluding it here.
Xraw = iris[1:1000,-c(5)] # this excludes the 5th column, which is the species column
#first, to get covariance, we need to subtract the mean from each column
X = scale(Xraw, scale = FALSE)
head(X)
Xs <- scale(Xraw, scale = TRUE)
head(Xs)
covMat = (t(X)%*%X)/ (nrow(X)-1)
head(covMat)
Is there a reason you can't use cov(numIris)
?
By trying to select 1000 rows of a matrix/data frame with only 150 rows, you end up with 850 rows full of NA
values (try tail(Xraw)
to see). If you set Xraw <- iris[, -5]
and go from there you get results such that all.equal(covMat, cov(iris[, -5]))
is TRUE
.