juliacluster-analysis

Julia - AssertionError in K-medoids algorithm


I am using K-medoids to reduce my data (potential future scenarios of a stock's price) from 400 scenarios to n = 25. I computed the distance amtrix based on the Eucledian distance:

C = pairwise(Euclidean(), data', data')

where data are my 400 scenarios - the data file can be found at link. Then I run K-medoids:

kmedoids(C, n, init = :kmpp)

I get the error:

ERROR: AssertionError: !(isempty(grp))

Does any of you know what could be wrong? It seems to run fine in Python (from sklearn_extra.cluster import KMedoids), but I guess the two K-medoids functions are different. I haven't been able to find any documentation on why it fails..


Solution

  • (Example)

    (In general, please provide minimal complete workable example code demonstrating the problem so readers can quickly reproduce it by importing the same packages and reading into the same data object. I guessed the following.)

    import Clustering.kmedoids
    import Distances.Euclidean
    import Distances.pairwise
    import DelimitedFiles.readdlm
    
    data = readdlm("data.csv")
    

    (Now your code reproduces the error.)

    n = 25
    C = pairwise(Euclidean(), data', data')
    kmedoids(C, n, init = :kmpp)
    
    ERROR: AssertionError: !(isempty(grp))
    Stacktrace:
     [1] _find_medoid(dist::Matrix{Float64}, grp::Vector{Int64})
       @ Clustering C:\Users\...\.julia\packages\Clustering\JwhfU\src\kmedoids.jl:229
     [2] _kmedoids!(medoids::Vector{Int64}, dist::Matrix{Float64}, maxiter::Int64, tol::Float64, displevel::Int64)
       @ Clustering C:\Users\...\.julia\packages\Clustering\JwhfU\src\kmedoids.jl:148
     [3] kmedoids(dist::Matrix{Float64}, k::Int64; init::Symbol, maxiter::Int64, tol::Float64, display::Symbol)
       @ Clustering C:\Users\...\.julia\packages\Clustering\JwhfU\src\kmedoids.jl:77
     [4] top-level scope
       @ REPL[13]:1
    

    Answer

    Maybe not enough distinct values in the data to create 25 or more different non-empty clusters.

    Set(data)
    

    produces

    Set{Float64} with 24 elements:
      -26.917
      12.619999999999997
      12.811
      12.79
      -4.722000000000001
      -27.619000000000003
      12.771
      11.009999999999998
      12.800999999999995
      -27.649
      9.980999999999995
      12.780999999999999
      10.780000000000001
      12.820999999999998
      12.790999999999997
      -27.001
      -27.586000000000002
      10.210999999999999
      7.9809999999999945
      11.989999999999995
      -4.579000000000001
      7.9709999999999965
      9.190999999999995
      -11.868000000000002
    

    The ?kmediods REPL help documentation ends with a note on the algorithm.

    The function implements a K-means style algorithm instead of PAM (Partitioning Around Medoids). K-means style algorithm converges in fewer iterations, but was shown to produce worse (10-20% higher total costs) results (see e.g. Schubert & Rousseeuw (2019)).

    The Clustering.kmedoids documentation references

    Teitz, M.B. and Bart, P. (1968). Heuristic Methods for Estimating the Generalized Vertex Median of a Weighted Graph. Operations Research, 16(5), 955–961. doi:10.1287/opre.16.5.955

    Schubert, E. and Rousseeuw, P.J. (2019). Faster k-medoids clustering: Improving the PAM, CLARA, and CLARANS Algorithms. SISAP, 171-187. doi:10.1007/978-3-030-32047-8_16