I want to create an organized stacked barplot where bars with similar proportions appear together. I have a data frame of 10,000 individuals and each individual comes from three populations. Here is my data.
library(MCMCpack)
library(ggplot2)
n = 10000
alpha = c(0.1, 0.1, 0.1)
q <- as.data.frame(rdirichlet(n,alpha))
head(q)
individuals <- c(1:nrow(q))
q <- cbind(q, individuals)
head(q)
V1 V2 V3 individuals
1 0.0032720232 3.381345e-08 0.996727943 1
2 0.3354060035 4.433923e-01 0.221201688 2
3 0.0004121665 9.661220e-01 0.033465842 3
4 0.9966997182 3.234048e-03 0.000066234 4
5 0.7789280208 2.090134e-01 0.012058562 5
6 0.0005048727 9.408364e-02 0.905411485 6
# long format for ggplot2 plotting
qm <- gather(q, key, value, -individuals)
colnames(qm) <- c("individuals", "ancestry", "proportions")
head(qm)
individuals ancestry proportions
1 1 V1 0.0032720232
2 2 V1 0.3354060035
3 3 V1 0.0004121665
4 4 V1 0.9966997182
5 5 V1 0.7789280208
6 6 V1 0.0005048727
Without any kind of ordering of data, I plotted the stacked barplot as:
ggplot(qm) + geom_bar(aes(x = individuals, y = proportions, fill= ancestry), stat="identity")
I have two questions: (1) I don't know how to make these individuals with similar proportions cluster together, and I have tried many solutions on stack exchange already but can't get them to work on my dataset!
(2) For some reason, it seems like when I implement the code to order individuals by decreasing/increasing proportions in one ancestry
, the code sometimes works on toy datasets of lower dimensions I create, but when I try to plot 10,000 individuals, the code doesn't work anymore! Is this a problem in ggplot2 or am I doing something wrong? I would appreciate any answer to this thread to also plot n = 10,000 stacked barplots.
(3) Not sure if I'm imagining this, but in my stacked barplot, it seems like R is clustering the stacked bar plots in some order unknown to me -- because I can see regular gaps between the stacked plots. In reality, there should be no gaps and I'm not sure why this is happening.
I would appreciate any help since I have already worked on this code for an embarrassingly long amount of time!!
Since, the variance of proportions within the ancestry is very high, the bars look like clustered with other ancestry. It is plotted in the right way. However, we couldn't distinguish the difference because the number of individuals is high.
If you think that the proportions on your data set would not lose it's meaning and could be interpreted in the same way if they're transformed intro exponential or log values, you can try it.
The stacked bar with exponential of the proportions:
ggplot(qm) + geom_bar(aes(x = individuals, y = exp(proportions), fill= ancestry),
stat="identity")
If you don't want have gaps between the bars, set widht to 1.
ggplot(qm) + geom_bar(aes(x = individuals, y = exp(proportions), fill= ancestry),
stat="identity",
width=1)