scikit-learncluster-analysisdirichlet

DPGMM Clustering All Values into Single Cluster


So I have converted my corpus into a nice word2vec matrix. This matrix is a floating point matrix of with negative & positive numbers.

I can't seem to get the infinite dirichlet process to give me any cohesive answer.

An example output [for 2 steps] looks like:

original word2vec matrix:
[[-0.09597077 -0.1617426  -0.01935256 ...,  0.03843787 -0.11019679
   0.02837373]
 [-0.20119116  0.09759717  0.1382935  ..., -0.08172804 -0.14392921
  -0.08032629]
 [-0.04258473  0.03070175  0.11503845 ..., -0.10350088 -0.18130976
  -0.02993774]
 ..., 
 [-0.08478324 -0.01961064  0.02305113 ..., -0.01231162 -0.10988192
   0.00473828]
 [ 0.13998444  0.05631495  0.00559074 ...,  0.05252389 -0.14202785
  -0.03951728]
 [-0.02888418 -0.0327519  -0.09636743 ...,  0.10880557 -0.08889513
  -0.08584201]]
Running DGPMM for 20 clusters of shape (4480, 100)
Bound after updating        z: -1935576384.727921
Bound after updating    gamma: -1935354454.981427
Bound after updating       mu: -1935354033.389434
Bound after updating  a and b: -inf
Cluster proportions: [  4.48098985e+03   1.00053406e+00   1.00053406e+00   1.00053406e+00
   1.00053406e+00   1.00053406e+00   1.00053406e+00   1.00053406e+00
   1.00053406e+00   1.00053406e+00   1.00053406e+00   1.00053406e+00
   1.00053406e+00   1.00053406e+00   1.00053406e+00   1.00053406e+00
   1.00053406e+00   1.00053406e+00   1.00053406e+00   1.00053406e+00]
covariance_type: full
Bound after updating        z: -inf
Bound after updating    gamma: -inf
Bound after updating       mu: -inf
Bound after updating  a and b: -inf
Cluster proportions: [  4.48098985e+03   1.00053406e+00   1.00053406e+00   1.00053406e+00
   1.00053406e+00   1.00053406e+00   1.00053406e+00   1.00053406e+00
   1.00053406e+00   1.00053406e+00   1.00053406e+00   1.00053406e+00
   1.00053406e+00   1.00053406e+00   1.00053406e+00   1.00053406e+00
   1.00053406e+00   1.00053406e+00   1.00053406e+00   1.00053406e+00]

As observable, it looks like z, gamma & mu all explode and eventually the system converges to just 1 cluster which is not really accurate. I have tried fiddling with alpha for the DPGMM but it doesnt really change much.

What I am trying to do is automatically cluster words that are closer to meaning using an autonomous clustering system. K-Means requires 'K' which I do not want to provide.


Solution

  • There may be some hidden numerical issues happening here. The problem is the high dimensionality of your data set. This will lead to infinitly small likelihoods in Gaussian mixture modeling, thus making the model very very unlikely. At some point, you appear to get an -inf value, and then it fails.

    Overall, the clustering seems to just badly fail. If you look at cluster sizes, you can both see the numerical problems as well as that the result has degenerated.

    One cluster is of size 4480.98985, the other 19 clusters are of size 1.00053406. This should add up to 4480, I guess... but it doesn't. Plus, 19 out of 20 clusters consist of a single element then? So you may have a problem with outliers, too.

    K-means won't work better either.