rcluster-analysisgmmmclust

Mclust() - NAs in model selection


I recently tried to perform a GMM in R on a multivariate matrix (400 obs of 196 var), which elements belong to known categories. The Mclust() function (from package mclust) gave very poor results (around 30% of individuals were well classified, whereas with k-means the result reaches more than 90%).

Here is my code :

library(mclust)

X <- read.csv("X.csv", sep = ",", h = T)
y <- read.csv("y.csv", sep = ",")
gmm <- Mclust(X, G = 5)    #I want 5 clusters

cl_gmm <- gmm$classification
cl_gmm_lab <- cl_gmm

for (k in 1:nclusters){
  ii = which(cl_gmm == k) # individuals of group k
  counts=table(y[ii]) # number of occurences for each label
  imax = which.max(counts) # Majority label
  maj_lab = attributes(counts)$dimnames[[1]][imax] 
  print(paste("Group ",k,", majority label = ",maj_lab))
  cl_gmm_lab[ii] = maj_lab
}

conf_mat_gmm <- table(y,cl_gmm_lab)    # CONFUSION MATRIX

The problem seems to come from the fact that every other model than "EII" (spherical, equal volume) is "NA" when looking at gmm$BIC.

Until now I did not find any solution to this problem...are you familiar with this issue?

Here is the link for the data: https://drive.google.com/file/d/1j6lpqwQhUyv2qTpm7KbiMRO-0lXC3aKt/view?usp=sharing Here is the link for the labels: https://docs.google.com/spreadsheets/d/1AVGgjS6h7v6diLFx4CxzxsvsiEm3EHG7/edit?usp=sharing&ouid=103045667565084056710&rtpof=true&sd=true


Solution

  • I finally found the answer. GMMs simply cannot apply every model when two much explenatory variables are involved. The right thing to do is first reduce dimensions and select an optimal number of dimensions that make it possible to properly apply GMMs while preserving as much informations as possible about the data.