rmachine-learningggplot2hclustggdendro

How to add categorical variables to a percentage stacked bar chart?


First time posting here so let me know if I left out any details that are normally included. I am using ggplot2 and ggdendro to make a stacked bar percentage chart with a heirarchical clustered tree where each node is associated with one of my bars.

1

As you can see I have more or less figured this out (note this is just a subset of my data. I now want to associate a categorical variable with each my bars, where each variable would be represented by a color (in my case this is HIV+ or HIV- and each bar represents % of cells in a given category). Additionally I want to figure out how to add the sample name to each dendrogram node but this issue is less pressing. Below is the code block I am using.

library(ggplot2)
library(ggdendro)

# Load in phenograph data
TotalPercentage <- read.csv("~/TotalPercentage.csv", header=TRUE)

#generate tree
tree <- hclust(dist(TotalPercentage))
tree <- dendro_data(tree)

data <- cbind(TotalPercentage, x = match(rownames(TotalPercentage), tree$labels$label))



# plot below stacked bar, in "data = tidyr::pivot_longer(data, c(2..." include
## all columns (clusters) but exclude colun 1 as this value is our sample ID

scale <- .5
p <- ggplot() +
  geom_col(
    data = tidyr::pivot_longer(data, c(2, 3 , 4, 5, 6, 7, 8)),
    aes(x = x,
        y = value, fill = factor(name)),
  ) +
  labs(title="Unsupervised Clustering of Phenograph Output",
          x ="Cluster Representation (%)", y = "Participant Sample"
  ) +
  geom_segment(
    data = tree$segments,
    aes(x = x, y = -y * scale, xend = xend, yend = -yend * scale)
  )

p

Here is a sample dataset with fewer rows for simplicity

data.frame(
  `Participant ID` = c("123", "456", "789"),
  `1` = c(.1933, .1721, 34.26),
  `2` = c(20.95, 4.97, 2.212),
  `3` = c(11.31, 35.34, .027),
  `4` = c(35.55, 15.03, 0),
  `5` = c(.26, .87, 7.58),
  `6` = c(12.85, 33.44, .033),
  `7` = c(2.04, 3.77, 4.32)
)

Where Patient one and three have HIV but patient 2 is HIV negative

And finally here is an example of what I am ultimately trying to produce

(https://i.sstatic.net/uAWxR.png)

I've looked all over to see how to do this but I'm new to R so I'm kind of free floating and don't know what to do next. Thanks in advance for any help.


Solution

  • Something like this, with randomly generated data:

    # randomly generated phenograph data
    set.seed(1)
    TotalPercentage <- data.frame(
      `Participant ID` = c("123", "456", "789"),
      `1` = 125*runif(72),
      `2` = 75*runif(72),
      `3` = 175*runif(72),
      `4` = 10*runif(72),
      `5` = 100*runif(72),
      `6` = 150*runif(72),
      `7` = 200*runif(72)
    )
    

    Now cluster, normalize and plot:

    tree <- hclust(dist(TotalPercentage))
    tree <- dendro_data(tree)
    data <- cbind(TotalPercentage, x = match(rownames(TotalPercentage), tree$labels$label))
    data[,2:8] <- data[,2:8] / rowSums(data[,2:8]) # row-normalize
    scale <- 3e-4
    ggplot() +
      geom_col(
        data = tidyr::pivot_longer(data, c(2, 3 , 4, 5, 6, 7, 8)),
        aes(x = x,
            y = value, fill = factor(name)),
      ) +
      labs(title="Unsupervised Clustering of Phenograph Output",
           x ="Cluster Representation (%)", y = "Participant Sample"
      ) +
      geom_segment(
        data = tree$segments,
        aes(x = x, y = -y * scale, xend = xend, yend = -yend * scale)
      ) +
      coord_flip()
    

    enter image description here