First time posting here so let me know if I left out any details that are normally included. I am using ggplot2 and ggdendro to make a stacked bar percentage chart with a heirarchical clustered tree where each node is associated with one of my bars.
As you can see I have more or less figured this out (note this is just a subset of my data. I now want to associate a categorical variable with each my bars, where each variable would be represented by a color (in my case this is HIV+ or HIV- and each bar represents % of cells in a given category). Additionally I want to figure out how to add the sample name to each dendrogram node but this issue is less pressing. Below is the code block I am using.
library(ggplot2)
library(ggdendro)
# Load in phenograph data
TotalPercentage <- read.csv("~/TotalPercentage.csv", header=TRUE)
#generate tree
tree <- hclust(dist(TotalPercentage))
tree <- dendro_data(tree)
data <- cbind(TotalPercentage, x = match(rownames(TotalPercentage), tree$labels$label))
# plot below stacked bar, in "data = tidyr::pivot_longer(data, c(2..." include
## all columns (clusters) but exclude colun 1 as this value is our sample ID
scale <- .5
p <- ggplot() +
geom_col(
data = tidyr::pivot_longer(data, c(2, 3 , 4, 5, 6, 7, 8)),
aes(x = x,
y = value, fill = factor(name)),
) +
labs(title="Unsupervised Clustering of Phenograph Output",
x ="Cluster Representation (%)", y = "Participant Sample"
) +
geom_segment(
data = tree$segments,
aes(x = x, y = -y * scale, xend = xend, yend = -yend * scale)
)
p
Here is a sample dataset with fewer rows for simplicity
data.frame(
`Participant ID` = c("123", "456", "789"),
`1` = c(.1933, .1721, 34.26),
`2` = c(20.95, 4.97, 2.212),
`3` = c(11.31, 35.34, .027),
`4` = c(35.55, 15.03, 0),
`5` = c(.26, .87, 7.58),
`6` = c(12.85, 33.44, .033),
`7` = c(2.04, 3.77, 4.32)
)
Where Patient one and three have HIV but patient 2 is HIV negative
And finally here is an example of what I am ultimately trying to produce
(https://i.sstatic.net/uAWxR.png)
I've looked all over to see how to do this but I'm new to R so I'm kind of free floating and don't know what to do next. Thanks in advance for any help.
Something like this, with randomly generated data:
# randomly generated phenograph data
set.seed(1)
TotalPercentage <- data.frame(
`Participant ID` = c("123", "456", "789"),
`1` = 125*runif(72),
`2` = 75*runif(72),
`3` = 175*runif(72),
`4` = 10*runif(72),
`5` = 100*runif(72),
`6` = 150*runif(72),
`7` = 200*runif(72)
)
Now cluster, normalize and plot:
tree <- hclust(dist(TotalPercentage))
tree <- dendro_data(tree)
data <- cbind(TotalPercentage, x = match(rownames(TotalPercentage), tree$labels$label))
data[,2:8] <- data[,2:8] / rowSums(data[,2:8]) # row-normalize
scale <- 3e-4
ggplot() +
geom_col(
data = tidyr::pivot_longer(data, c(2, 3 , 4, 5, 6, 7, 8)),
aes(x = x,
y = value, fill = factor(name)),
) +
labs(title="Unsupervised Clustering of Phenograph Output",
x ="Cluster Representation (%)", y = "Participant Sample"
) +
geom_segment(
data = tree$segments,
aes(x = x, y = -y * scale, xend = xend, yend = -yend * scale)
) +
coord_flip()