rpca

Adding lines to connect separate cluster in a chart


I saw this neat principal component analysis graph online, where they had lines connecting each cluster to a center point.

enter image description here

I used an example data set to show that I have made it up to adding the ellipses, but after looking online, I think this PCA package currently doesnt have the ability to add these, and in some cases, it is called as a "star". Is there a way to somehow loophole around and add this into a PCA chart?

I have added some sample code below that gets up the the part that doesn't have the lines connecting. Suggestions on this would be great please. My last thought is maybe using ggforce or something along those lines?

library(factoextra)
data(iris)

res.pca <- prcomp(iris[,-5], scale=TRUE)

fviz_pca_ind(res.pca, label="none", alpha.ind=1, pointshape=19,habillage=iris$Species, addEllipses = TRUE, ellipse.level=0.95)

Some comments have suggested these sites, but while it is close, it is a bit different since I am trying to use a data frame with one of the columns being that of the different categories I hope to use for the different clusterings.

link 1

link 2

Any possible suggestions would be much appreciated please.


Solution

  • A quick and dirty hack is to create an edges df out of the ggplot data inside the output from fviz_pca_ind(), and then plot it with geom_segment().

    Note that this might be visually sub-optimal because you often need the edges to be drawn before the nodes in order to highlight (i.e. not hide) the position of the latter. But barring a rewrite of df_raw_pca_viz and the fviz plotting functions, this is a a quick way to get what you asked.

    Try:

    library(factoextra)
    library(purrr)
    library(dplyr)
    data(iris)
    
    res.pca <- prcomp(iris[,-5], scale=TRUE)
    
    g1 <- fviz_pca_ind(res.pca, label="none", alpha.ind=1, pointshape=19,habillage=iris$Species, addEllipses = TRUE, ellipse.level=0.95)
    
    df_edges <- 
      pluck(g1, "data") |> as_tibble() |>
      group_by(Groups) %>% 
      summarise(xend = mean(x), yend = mean(y)) |>
      left_join(y =  pluck(g1, "data"), 
                by = "Groups", 
                multiple = "all")
    
    g1 +
      geom_segment(data = df_edges, aes(xend = xend, yend = yend, x = x, y = y, colour = Groups), alpha = 0.25)
    

    enter image description here