I am working with a relatively big data set (only using about 1/32 of it, but this subset is approx. 50000x9000). In order to perform analysis on this, I have taken several steps to reduce the dimensionality, so that I can then apply some sort of clustering algorithm.
Take a look at the following data frame:
set.seed(340)
df = data.frame(replicate(10,sample(0:10,size = 10,replace = TRUE)))
> df
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 4 9 4 6 9 4 2 5 8 8
2 5 8 2 0 4 6 1 1 0 10
3 1 7 6 3 5 9 6 0 7 1
4 0 6 8 6 6 0 5 5 10 10
5 2 0 5 8 2 10 8 2 1 5
6 3 9 10 2 8 5 2 10 3 10
7 9 0 1 0 6 8 9 6 5 0
8 5 6 9 3 10 4 4 8 6 9
9 8 7 6 2 10 9 9 7 1 10
10 0 7 2 6 1 6 3 2 3 9
Each row represents a person, and each variable says how often that person exhibited that quality. Say I perform principal component analysis on this using princomp(), and collect the first four pc's to use for k means.
pc = princomp(df)
new_df = cbind(pc$loadings[,1],pc$loading[,2],pc$loadings[,3],pc$loadings[,4])
fit = kmeans(new_df,2)
From this I can deduce what cluster exhibits high values of what principal components, where I can use the loadings to see what each principal component is a general measure off. However, I would like to ultimately connect this information to my original data set. Is there a way that I can cluster each person in the original data to a cluster created from the k means on the principal component analysis? Or am I misunderstanding the concept of PCA.
pc$loadings
finds the coordinates of the input variables, not that of the individuals. So kmeans(new_df,2)
classifies variables and not individuals. To make sure of this, try your code with a 10x5 data.frame instead of a 10x10 one : you only get 5 cluster coordinates:
df = data.frame(replicate(5,sample(0:10,size = 10,replace = TRUE)))
pc = princomp(df)
new_df = cbind(pc$loadings[,1],pc$loading[,2],pc$loadings[,3],pc$loadings[,4])
fit = kmeans(new_df,2)
fit$cluster
X1 X2 X3 X4 X5
2 2 1 2 2
If that is what you want to do, then you can just rbind
fit$cluster
to your original data.frame and you will have the cluster of your variables.
df <- rbind(df,fit$cluster)
However, if you intended to cluster individuals, i.e. rows of your original data.frame, you need to perform the clustering on the row coordinates produced by the principal component analysis. I don't know how to access those in princomp
, but other pca methods allow this easily. FactoMineR::PCA
outputs a list with row coordinates ($ind$coord
) and column coordinates ($var$coord
).
library(FactoMineR)
pf <- PCA(df,graph=FALSE)
fit <- kmeans(pf$ind$coord[,1:4],2)
fit$cluster
1 2 3 4 5 6 7 8 9 10
1 2 1 1 1 2 1 1 1 2
To add those to your original data.frame:
df$cluster <- fit$cluster