rggplot2cluster-analysisgmmmclust

visualizing clusters extracted from MClust using ggplot2


I am analysing the distribution of my data using mclust (follow-up to Clustering with Mclust results in an empty cluster)
Here my data for download https://www.file-upload.net/download-14320392/example.csv.html

First, I evaluate the clusters present in my data:

library(reshape2)
library(mclust)
library(ggplot2)

data <- read.csv(file.choose(), header=TRUE,  check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)

fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)

---------------------------------------------------- 
Gaussian finite mixture model fitted by EM algorithm 
---------------------------------------------------- 

Mclust E (univariate, equal variance) model with 4 components: 

log-likelihood    n df       BIC       ICL
-20504.71 3258  8 -41074.13 -44326.69

Clustering table:
1    2    3    4 
0 2271  896   91 

Mixing probabilities:
1         2         3         4 
0.2807685 0.4342499 0.2544305 0.0305511 

Means:
1        2        3        4 
1381.391 1381.715 1574.335 1851.667 

Variances:
1        2        3        4 
7466.189 7466.189 7466.189 7466.189 

Now having them identified, I would like to overlay the total distribution with distributions of the individual components. To do this, I tried to extract the assignment of each value to the respective cluster using:

df <- as.data.frame(data)
df$classification <- as.factor(df$value[fit$classification])

ggplot(df, aes(value, fill= classification)) + 
  geom_density(aes(col=classification, fill = NULL), size = 1)

As a result, I get the following: enter image description here

It looks to have worked, however, I wonder,
a) where the descriptions (1602, 1639 and 1823) of the individual classifications come from
b) how I can scale the individual densities as a fraction of the total (for example 1823 contributes only 91 values out of 3258 observations; see above)
c) if it makes sense to alternatively use predicted normal distributions based on the mean + SD obtained?

Any help or suggestions are highly appreciated!


Solution

  • I think you could get what you want in the following way:

    library(magrittr)
    data_melt <- data_melt %>% mutate(class = as.factor(fit$classification))
    ggplot(data_melt, aes(x=value, colour=class, fill=class)) + 
        geom_density(aes(y=..count..), alpha=.25)
    

    enter image description here