rpcaunsupervised-learningbiplot

I want to create a subset of my dataframe by how subjects cluster in the biplot


This is one of the biplots that I am working on. Circles represent clusters that I want to create a subset dataframe from

enter image description here

If I'm interested in the top cluster, how do I select data that lies within the rectangle -.1 < PC1 <.1 & .8 < PC2 < 1.6?

I can't share my data, but we can practice using the iris set.

enter image description here

library("ISLR")
biplot(prcomp(iris[,1:4]))

Suppose I'm interested in the data in the rectangle -.125 < PC1 <-.75 & -.15 < PC2 < 1.0

How do I identify that data and create a subset out of it?


Solution

  • You can access the projected points using .$x :

    pc_res <- prcomp(iris[,1:4])
    str(pc_res) # find that the data is stored in .$x
    #> List of 5
    #>  $ sdev    : num [1:4] 2.056 0.493 0.28 0.154
    #>  $ rotation: num [1:4, 1:4] 0.3614 -0.0845 0.8567 0.3583 -0.6566 ...
    #>   ..- attr(*, "dimnames")=List of 2
    #>   .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
    #>   .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
    #>  $ center  : Named num [1:4] 5.84 3.06 3.76 1.2
    #>   ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
    #>  $ scale   : logi FALSE
    #>  $ x       : num [1:150, 1:4] -2.68 -2.71 -2.89 -2.75 -2.73 ...
    #>   ..- attr(*, "dimnames")=List of 2
    #>   .. ..$ : NULL
    #>   .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
    #>  - attr(*, "class")= chr "prcomp"
    dframe <- as.data.frame(pc_res$x)
    sub_res <- subset(x = dframe, subset = -.125 < dframe$PC1 &
                              dframe$PC1 <.75 &
                              -.15 < dframe$PC2 &
                              dframe$PC2 < 1.0)
    head(sub_res)
    #>             PC1       PC2         PC3          PC4
    #> 54  0.183317720 0.8279590  0.17959139  0.093566840
    #> 56  0.641669084 0.4182469 -0.04107609 -0.243116767
    #> 60 -0.008745404 0.7230819 -0.28114143 -0.005618918
    #> 62  0.511698557 0.1039812 -0.13054775  0.050719232
    #> 63  0.264976508 0.5500365  0.69414683  0.057185519
    #> 67  0.660283762 0.3529697 -0.32802753 -0.187878621
    

    EDIT : For clustering I would do it with an algorithm (kmeans here) :

    # if you want cluster from projection on (PC1, PC2)
    dframe <- as.data.frame(prcomp(iris[,1:4])$x)
    classif <- kmeans(x = dframe[,1:2], centers = 3, iter.max = 100, nstart = 10)
    classif
    #> K-means clustering with 3 clusters of sizes 61, 39, 50
    #> 
    #> Cluster means:
    #>         PC1        PC2
    #> 1  0.665676  0.3316042
    #> 2  2.346527 -0.2739386
    #> 3 -2.642415 -0.1908850
    #> 
    #> Clustering vector:
    #>   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
    #>  [36] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
    #>  [71] 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2
    #> [106] 2 1 2 2 2 2 2 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2
    #> [141] 2 2 1 2 2 2 1 2 2 1
    #> 
    #> Within cluster sum of squares by cluster:
    #> [1] 31.87959 18.87111 13.06924
    #>  (between_SS / total_SS =  90.4 %)
    #> 
    #> Available components:
    #> 
    #> [1] "cluster"      "centers"      "totss"        "withinss"    
    #> [5] "tot.withinss" "betweenss"    "size"         "iter"        
    #> [9] "ifault"
    
    # check visually your groups
    str(classif)
    #> List of 9
    #>  $ cluster     : int [1:150] 3 3 3 3 3 3 3 3 3 3 ...
    #>  $ centers     : num [1:3, 1:2] 0.666 2.347 -2.642 0.332 -0.274 ...
    #>   ..- attr(*, "dimnames")=List of 2
    #>   .. ..$ : chr [1:3] "1" "2" "3"
    #>   .. ..$ : chr [1:2] "PC1" "PC2"
    #>  $ totss       : num 666
    #>  $ withinss    : num [1:3] 31.9 18.9 13.1
    #>  $ tot.withinss: num 63.8
    #>  $ betweenss   : num 602
    #>  $ size        : int [1:3] 61 39 50
    #>  $ iter        : int 2
    #>  $ ifault      : int 0
    #>  - attr(*, "class")= chr "kmeans"
    classif$centers
    #>         PC1        PC2
    #> 1  0.665676  0.3316042
    #> 2  2.346527 -0.2739386
    #> 3 -2.642415 -0.1908850
    dframe$group <- classif$cluster
    plot(x = dframe$PC1, y = dframe$PC2, col = dframe$group) # so you want group with minimal center
    

    
    result <- dframe[dframe$group == 1,] # or subset(x = dframe, subset = dframe$group == 1)
    head(result)
    #>           PC1         PC2         PC3           PC4 group
    #> 52  0.9324885 -0.31833364  0.01801419  0.0005665121     1
    #> 54  0.1833177  0.82795901  0.17959139  0.0935668402     1
    #> 55  1.0881033 -0.07459068  0.30775790  0.1120205742     1
    #> 56  0.6416691  0.41824687 -0.04107609 -0.2431167665     1
    #> 57  1.0950607 -0.28346827 -0.16981024 -0.0835565724     1
    #> 58 -0.7491227  1.00489096 -0.01230292 -0.0179077226     1
    

    Final word : There is a very nice graphical answer on SO about optimal clustering : cluster-analysis-in-r-determine-the-optimal-number-of-clusters. Also some packages allow you to use ggplot2 like FactomineR, ...