This is one of the biplots that I am working on. Circles represent clusters that I want to create a subset dataframe from
If I'm interested in the top cluster, how do I select data that lies within the rectangle -.1 < PC1 <.1 & .8 < PC2 < 1.6?
I can't share my data, but we can practice using the iris set.
library("ISLR")
biplot(prcomp(iris[,1:4]))
Suppose I'm interested in the data in the rectangle -.125 < PC1 <-.75 & -.15 < PC2 < 1.0
How do I identify that data and create a subset out of it?
You can access the projected points using .$x :
pc_res <- prcomp(iris[,1:4])
str(pc_res) # find that the data is stored in .$x
#> List of 5
#> $ sdev : num [1:4] 2.056 0.493 0.28 0.154
#> $ rotation: num [1:4, 1:4] 0.3614 -0.0845 0.8567 0.3583 -0.6566 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
#> .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
#> $ center : Named num [1:4] 5.84 3.06 3.76 1.2
#> ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
#> $ scale : logi FALSE
#> $ x : num [1:150, 1:4] -2.68 -2.71 -2.89 -2.75 -2.73 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : NULL
#> .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
#> - attr(*, "class")= chr "prcomp"
dframe <- as.data.frame(pc_res$x)
sub_res <- subset(x = dframe, subset = -.125 < dframe$PC1 &
dframe$PC1 <.75 &
-.15 < dframe$PC2 &
dframe$PC2 < 1.0)
head(sub_res)
#> PC1 PC2 PC3 PC4
#> 54 0.183317720 0.8279590 0.17959139 0.093566840
#> 56 0.641669084 0.4182469 -0.04107609 -0.243116767
#> 60 -0.008745404 0.7230819 -0.28114143 -0.005618918
#> 62 0.511698557 0.1039812 -0.13054775 0.050719232
#> 63 0.264976508 0.5500365 0.69414683 0.057185519
#> 67 0.660283762 0.3529697 -0.32802753 -0.187878621
EDIT : For clustering I would do it with an algorithm (kmeans here) :
# if you want cluster from projection on (PC1, PC2)
dframe <- as.data.frame(prcomp(iris[,1:4])$x)
classif <- kmeans(x = dframe[,1:2], centers = 3, iter.max = 100, nstart = 10)
classif
#> K-means clustering with 3 clusters of sizes 61, 39, 50
#>
#> Cluster means:
#> PC1 PC2
#> 1 0.665676 0.3316042
#> 2 2.346527 -0.2739386
#> 3 -2.642415 -0.1908850
#>
#> Clustering vector:
#> [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
#> [36] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [71] 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2
#> [106] 2 1 2 2 2 2 2 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2
#> [141] 2 2 1 2 2 2 1 2 2 1
#>
#> Within cluster sum of squares by cluster:
#> [1] 31.87959 18.87111 13.06924
#> (between_SS / total_SS = 90.4 %)
#>
#> Available components:
#>
#> [1] "cluster" "centers" "totss" "withinss"
#> [5] "tot.withinss" "betweenss" "size" "iter"
#> [9] "ifault"
# check visually your groups
str(classif)
#> List of 9
#> $ cluster : int [1:150] 3 3 3 3 3 3 3 3 3 3 ...
#> $ centers : num [1:3, 1:2] 0.666 2.347 -2.642 0.332 -0.274 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:3] "1" "2" "3"
#> .. ..$ : chr [1:2] "PC1" "PC2"
#> $ totss : num 666
#> $ withinss : num [1:3] 31.9 18.9 13.1
#> $ tot.withinss: num 63.8
#> $ betweenss : num 602
#> $ size : int [1:3] 61 39 50
#> $ iter : int 2
#> $ ifault : int 0
#> - attr(*, "class")= chr "kmeans"
classif$centers
#> PC1 PC2
#> 1 0.665676 0.3316042
#> 2 2.346527 -0.2739386
#> 3 -2.642415 -0.1908850
dframe$group <- classif$cluster
plot(x = dframe$PC1, y = dframe$PC2, col = dframe$group) # so you want group with minimal center
result <- dframe[dframe$group == 1,] # or subset(x = dframe, subset = dframe$group == 1)
head(result)
#> PC1 PC2 PC3 PC4 group
#> 52 0.9324885 -0.31833364 0.01801419 0.0005665121 1
#> 54 0.1833177 0.82795901 0.17959139 0.0935668402 1
#> 55 1.0881033 -0.07459068 0.30775790 0.1120205742 1
#> 56 0.6416691 0.41824687 -0.04107609 -0.2431167665 1
#> 57 1.0950607 -0.28346827 -0.16981024 -0.0835565724 1
#> 58 -0.7491227 1.00489096 -0.01230292 -0.0179077226 1
Final word : There is a very nice graphical answer on SO about optimal clustering : cluster-analysis-in-r-determine-the-optimal-number-of-clusters. Also some packages allow you to use ggplot2
like FactomineR
, ...