I am trying to implement Kernel K Means clustering with the kkmeans()
function from the kernlab
R package. My problem is that my code returns the expected output when I specify some numbers of clusters with the function's clusters
argument, but throws an error for other numbers of clusters:
Error in if (sum(abs(dc)) < 1e-15) break : missing value where TRUE/FALSE needed
My guess is that this is a convergence issue since the error seems to arise when I increase the number of clusters, but this would be surprising since I have many more rows than the number of clusters I'm specifying. While I can specify 10 clusters with success with an 8000x3 matrix, I receive an error with 100 clusters. Similarly, I can specify 5 clusters but not 10 with a 50-row subset of that data.
Below is a reproducible minimal example where my code replicates the success and the error.
centers = 10
kernlab::kkmeans(mymat, centers=10)
#> Using automatic sigma estimation (sigest) for RBF or laplace kernel
#> Error in if (sum(abs(dc)) < 1e-15) break: missing value where TRUE/FALSE needed
centers = 5
kernlab::kkmeans(mymat, centers=5)
#> Using automatic sigma estimation (sigest) for RBF or laplace kernel
#> Spectral Clustering object of class "specc"
#>
#> Cluster memberships:
#>
#> 1 1 1 1 2 1 1 3 3 5 5 5 3 2 2 2 4 4 3 3 5 2 2 5 5 5 5 5 5 2 4 3 3 3 2 2 5 3 3 5 5 4 4 4 3 1 4 2 5 3
#>
#> Gaussian Radial Basis kernel function.
#> Hyperparameter : sigma = 0.756590498067127
#>
#> Centers:
#> [,1] [,2] [,3]
#> [1,] 15.75871 -16.69486 191.5841
#> [2,] 16.74850 -21.94730 186.8914
#> [3,] 15.99483 -18.95892 190.2622
#> [4,] 15.45729 -18.13571 191.9611
#> [5,] 16.69136 -22.19600 187.0055
#>
#> Cluster size:
#> [1] 7 10 12 7 14
#>
#> Within-cluster sum of squares:
#> [1] 301006.7 443237.8 607889.4 305777.1 685823.5
mymat <- structure(c(15.9390001296997, 15.9079999923706, 16.087999343872,
15.7930002212524, 15.9619998931884, 15.6129999160766, 15.7550001144409,
16.7740001678466, 16.9080009460449, 17.0769996643066, 16.3640003204345,
16.5960006713867, 16.579999923706, 16.4570007324218, 16.2320003509521,
16.1639995574951, 15.6180000305175, 15.5109996795654, 15.5120000839233,
15.628999710083, 16.9950008392333, 17.3530006408691, 17.2229995727539,
16.8910007476806, 17.1800003051757, 17.1709995269775, 16.9860000610351,
16.704999923706, 16.273000717163, 15.8830003738403, 15.6230001449584,
15.333999633789, 15.3839998245239, 15.3870000839233, 17.1119995117187,
17.6200008392333, 16.8349990844726, 16.4969997406005, 16.2479991912841,
16.1259994506835, 15.8059997558593, 15.378999710083, 15.4320001602172,
15.2100000381469, 15.2519998550415, 15.2150001525878, 15.4280004501342,
17.4790000915527, 16.6739997863769, 16.4330005645751, -16.6299991607666,
-16.9529991149902, -17.5610008239746, -17.8290004730224, -18.6200008392333,
-17.1079998016357, -16.25, -21.716999053955, -21.1219997406005,
-21.8209991455078, -20.1840000152587, -20.0450000762939, -20.9599990844726,
-19.5240001678466, -18.6590003967285, -19.4379997253417, -18.6280002593994,
-18.0669994354248, -16.204999923706, -15.5830001831054, -23.9489994049072,
-23.57200050354, -24.3969993591308, -23.2880001068115, -22.6019992828369,
-23.2329998016357, -22.5979995727539, -22.6140003204345, -20.8059997558593,
-19.4300003051757, -19.4729995727539, -17.5690002441406, -16.8110008239746,
-15.2930002212524, -25.2509994506835, -24.7649993896484, -24.8080005645751,
-21.9939994812011, -21.5189990997314, -20.329999923706, -20.25,
-19.1380004882812, -18.6180000305175, -18.5900001525878, -16.1620006561279,
-14.5329999923706, -14.4359998703002, -25.8169994354248, -24.2159996032714,
-22.57200050354, 190.996994018554, 190.996002197265, 190.18699645996,
191.039993286132, 190.205993652343, 191.919006347656, 191.766006469726,
187.14599609375, 186.889007568359, 186.225997924804, 188.60400390625,
187.932006835937, 187.837005615234, 188.453002929687, 189.382995605468,
189.360000610351, 191.25, 191.845001220703, 192.580001831054,
192.414993286132, 185.358001708984, 184.570999145507, 184.595993041992,
186.091995239257, 185.613998413085, 185.25, 186.235000610351,
187.003005981445, 188.744995117187, 190.169998168945, 190.921005249023,
192.628997802734, 192.768005371093, 193.281997680664, 184.602996826171,
183.796005249023, 185.414001464843, 187.811004638671, 188.615005493164,
189.263000488281, 190.167007446289, 191.781997680664, 191.837997436523,
192.582000732421, 193.399002075195, 194.184005737304, 193.509994506835,
183.776000976562, 186.173995971679, 187.774993896484), dim = c(50L,
3L), dimnames = list(NULL, c("x", "y", "z")))
This appears to be an issue with something randomly-generated internally by the function during your kkmeans()
call. I don't have an answer for "why" this is happening and you'll likely have to check with the authors to determine if it's a bug or intended behavior.
While I reproduced your error with your data and code (running a fresh instance of R every time), the exact same function call also sometimes produces other errors and sometimes doesn't produce an error. However, whether it does so is entirely reproducible when you set.seed()
, suggesting it is has something to do with starting values that determine other parameters of the model.
Below I show (a) that this can produce an alternative error (actually, I saw a third but didn't save the seed to reproduce it), (b) that even when it does "converge," it is producing pretty different clusters just on the basis of the random seed, and (c) the hyperparameter tuning is heavily influenced by the random number seed. I forgot to save the seed for the run where I was able to get some clustering results with 10 clusters.
I don't have an answer for why this happens: my hunch is that the automatically-generated settings are nonsensical/out of bounds in some cases and this is producing an error. This may be because your data are in some way strange or may be because the algorithm for setting the hyperparameter(s) doesn't make much sense. It could also be a bug, so perhaps worth posting as an issue.
In any case, a question to ask yourself is whether you want to use something where the behavior is this inconsistent at producing results, produces pretty different results across random seeds, and you don't know if the algorithm is actually doing what it says when it does, etc.
clusters=5
, no error, set.seed(123)
set.seed(123)
#> Hyperparameter : sigma = 0.463522505156128
#>
#> Centers:
#> [,1] [,2] [,3]
#> [1,] 16.53045 -21.18700 187.8918
#> [2,] 17.16138 -24.59687 184.7860
#> [3,] 15.73436 -17.87491 191.2586
#> [4,] 15.63425 -16.63862 192.0088
#> [5,] 16.19467 -20.16442 189.1617
#>
#> Cluster size:
#> [1] 11 8 11 8 12
#>
#> Within-cluster sum of squares:
#> [1] 537972.8 386310.2 544994.1 391965.9 604386.9
clusters=5
, no error, set.seed(3)
Works, but pretty different numbers of observations per cluster! Note the different hyperparameter.
#> Hyperparameter : sigma = 0.290281708176631
#>
#> Centers:
#> [,1] [,2] [,3]
#> [1,] 15.97636 -18.38464 190.5449
#> [2,] 16.24809 -20.10409 188.9572
#> [3,] 15.63660 -17.85633 191.5151
#> [4,] 17.06100 -22.70840 185.8834
#> [5,] 17.16138 -24.59687 184.7860
#>
#> Cluster size:
#> [1] 11 11 15 5 8
#>
#> Within-cluster sum of squares:
#> [1] 545547.7 538434.5 757947.0 236986.8 386310.2
clusters=5
, no error, set.seed(999)
Works, but pretty different numbers of observations per cluster! Note the different hyperparameter again!
#> Gaussian Radial Basis kernel function.
#> Hyperparameter : sigma = 0.128189488632645
#>
#> Centers:
#> [,1] [,2] [,3]
#> [1,] 16.93157 -22.25171 186.4579
#> [2,] 15.45090 -15.99500 192.8452
#> [3,] 15.73677 -18.32277 191.0152
#> [4,] 17.16244 -24.44533 184.8376
#> [5,] 16.32218 -20.69291 188.5965
#>
#> Cluster size:
#> [1] 7 10 13 9 11
#>
#> Within-cluster sum of squares:
#> [1] 294630.1 457490.3 604486.8 441669.5 539478.6
clusters = 10
, new error, set.seed(99)
New error.
#> Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'affinMult' for signature '"rbfkernel", "numeric"'
clusters = 10
, new error, set.seed(3)
Original error.
#> Error in if (sum(abs(dc)) < 1e-15) break: missing value where TRUE/FALSE needed
Not included: additional error with clusters = 10 (not finding all of the columns in the matrix) and successfully getting some clusters with clusters = 10.