pythonpandasscikit-learncluster-analysis

Divide data into clusters by a linear function


I have a number of rows, which form three noticeable lines on a graph.

enter image description here

Sample data

line_position,queue_number,real_seq
0,2280,41171
55,3375,24999
55,733,11506
45,3939,29185
80,1522,14121
70,1022,10953
15,4687,24235
55,2072,14898
55,1755,12913
75,2014,17938
50,2178,14281
5,5612,36370
0,5689,38861
5,8023,40942
65,2777,21954
15,7384,39900
30,5241,35130
40,3554,19147
20,6663,37397
5,5134,28694
5,5273,32029
65,514,12791
10,7560,39851
25,6450,36909
50,1130,27140
20,4430,23025
0,5685,37094
0,5949,40905
20,6842,37547
5,5278,31231
15,7367,39031
40,4340,31534
35,3680,19437
5,5236,30761
5,2104,29053
0,5947,40685
45,3128,17475
40,4386,31495
50,3922,31394
15,7307,38805
55,3403,26704
70,2604,20509
5,5574,34118
55,733,11668
20,6663,37223
25,6430,37171
55,1815,12632
60,3094,23472
30,5798,36262
30,5293,34687
20,6554,37454
35,4767,34735
40,4411,31716
30,5427,35581
40,3350,18316
50,1075,14794
85,948,13668
80,1601,16079
5,4868,26220
20,6554,37075
5,2100,33351
75,666,5799
50,980,15290
95,387,7418
30,1715,20606
15,1980,25981
35,4759,30730
20,4603,24254
5,5059,28033
5,5257,32243
45,1308,16861
0,5849,38680
85,414,6927
0,2148,35148
70,2551,21015
35,4581,32535
80,561,6001
0,5672,35715
5,5152,33120
35,4984,34437
55,3574,27528
35,3762,19995
30,5798,39146
0,5911,40312
85,387,5917
35,4581,35933
55,754,11654
40,3610,25147
0,2252,39270
5,2042,34883
0,6032,41330
80,1826,20158
30,4075,21742
10,7517,40283
45,3029,19383
30,4933,32675
40,1479,21945
10,4826,25687
25,6380,37256
75,364,8215

I need to divide these rows into three clusters. I've tried using multiple clustering algorithms (AgglomerativeClustering, Birch, DBSCAN, KMeans, MiniBatchKMeans, MeanShift) from sklearn.cluster, but as expected these algorithms divide my data not in the way I need.

Looking at the graph the simplest seems to be to "draw" two lines, which would split my data into three clusters.

However, I did not find any ready made tools which would allow me to do that. How can I achieve this? Is there a better way to divide my data into three clusters?


Solution

  • This is what your sample data looks like after standardizing. The sample you shared is small enough that none of the models are going to pick up on the clusters. If your dataset is large enough, and you standardize your data and choose the right tuning parameters, the classifiers should be able to separate the clusters.

    Otherwise, you can graph lines by hand, figure out the slopes and intercepts of the separating lines, and separate the data using the lines.

    enter image description here

    from sklearn.preprocessing import StandardScaler
    from sklearn.cluster import DBSCAN
    from sklearn.cluster import SpectralClustering
    from sklearn.cluster import KMeans
    from sklearn.cluster import AgglomerativeClustering
    import matplotlib.pyplot as plt
    import numpy as np
    
    s = [[0,2280,41171],[55,3375,24999],[55,733,11506],[45,3939,29185],[80,1522,14121],[70,1022,10953],
         [15,4687,24235],[55,2072,14898],[55,1755,12913],[75,2014,17938],[50,2178,14281],[5,5612,36370],
         [0,5689,38861],[5,8023,40942],[65,2777,21954],[15,7384,39900],[30,5241,35130],[40,3554,19147],
         [20,6663,37397],[5,5134,28694],[5,5273,32029],[65,514,12791],[10,7560,39851],[25,6450,36909],
         [50,1130,27140],[20,4430,23025],[0,5685,37094],[0,5949,40905],[20,6842,37547],[5,5278,31231],
         [15,7367,39031],[40,4340,31534],[35,3680,19437],[5,5236,30761],[5,2104,29053],[0,5947,40685],
         [45,3128,17475],[40,4386,31495],[50,3922,31394],[15,7307,38805],[55,3403,26704],[70,2604,20509],
         [5,5574,34118],[55,733,11668],[20,6663,37223],[25,6430,37171],[55,1815,12632],[60,3094,23472],
         [30,5798,36262],[30,5293,34687],[20,6554,37454],[35,4767,34735],[40,4411,31716],[30,5427,35581],
         [40,3350,18316],[50,1075,14794],[85,948,13668],[80,1601,16079],[5,4868,26220],[20,6554,37075],
         [5,2100,33351],[75,666,5799],[50,980,15290],[95,387,7418],[30,1715,20606],[15,1980,25981],
         [35,4759,30730],[20,4603,24254],[5,5059,28033],[5,5257,32243],[45,1308,16861],[0,5849,38680],
         [85,414,6927],[0,2148,35148],[70,2551,21015],[35,4581,32535],[80,561,6001],[0,5672,35715],
         [5,5152,33120],[35,4984,34437],[55,3574,27528],[35,3762,19995],[30,5798,39146],[0,5911,40312],
         [85,387,5917],[35,4581,35933],[55,754,11654],[40,3610,25147],[0,2252,39270],[5,2042,34883],
         [0,6032,41330],[80,1826,20158],[30,4075,21742],[10,7517,40283],[45,3029,19383],[30,4933,32675],
         [40,1479,21945],[10,4826,25687],[25,6380,37256],[75,364,8215]]
    
    X = StandardScaler().fit_transform(np.array(s))[:, :2]
    plt.scatter(X[:, 0], X[:, 1], s=20)
    plt.show()
    
    models = (
        DBSCAN(eps=0.5, min_samples=2),
        SpectralClustering(n_clusters=3, assign_labels="discretize"),
        KMeans(n_clusters=3),
        AgglomerativeClustering(n_clusters=3, )
        )
    
    for m in models:
        m.fit(X)
        plt.scatter(X[:, 0], X[:, 1], s=20, c=m.labels_)
        plt.show()
    

    Here's an example of separating the data by hand:

    df = pd.DataFrame(s, columns=["line_position","queue_number","real_seq"])
    df["labels"] = (df.line_position > ((-90/4000) * df.queue_number) + 90).astype(int)
    df.labels = df.labels + (df.line_position > ((-100/7000) * df.queue_number) + 100).astype(int)
    plt.scatter(df.queue_number, df.line_position, s=20, c=df.labels)
    plt.show()