I am analysing the distribution of my data using mclust (follow-up to Clustering with Mclust results in an empty cluster)
Here my data for download https://www.file-upload.net/download-14320392/example.csv.html
First, I evaluate the clusters present in my data:
library(reshape2)
library(mclust)
library(ggplot2)
data <- read.csv(file.choose(), header=TRUE, check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust E (univariate, equal variance) model with 4 components:
log-likelihood n df BIC ICL
-20504.71 3258 8 -41074.13 -44326.69
Clustering table:
1 2 3 4
0 2271 896 91
Mixing probabilities:
1 2 3 4
0.2807685 0.4342499 0.2544305 0.0305511
Means:
1 2 3 4
1381.391 1381.715 1574.335 1851.667
Variances:
1 2 3 4
7466.189 7466.189 7466.189 7466.189
Now having them identified, I would like to overlay the total distribution with distributions of the individual components. To do this, I tried to extract the assignment of each value to the respective cluster using:
df <- as.data.frame(data)
df$classification <- as.factor(df$value[fit$classification])
ggplot(df, aes(value, fill= classification)) +
geom_density(aes(col=classification, fill = NULL), size = 1)
As a result, I get the following:
It looks to have worked, however, I wonder,
a) where the descriptions (1602, 1639 and 1823) of the individual classifications come from
b) how I can scale the individual densities as a fraction of the total (for example 1823 contributes only 91 values out of 3258 observations; see above)
c) if it makes sense to alternatively use predicted normal distributions based on the mean + SD obtained?
Any help or suggestions are highly appreciated!
I think you could get what you want in the following way:
library(magrittr)
data_melt <- data_melt %>% mutate(class = as.factor(fit$classification))
ggplot(data_melt, aes(x=value, colour=class, fill=class)) +
geom_density(aes(y=..count..), alpha=.25)