coordinatescluster-analysisdbscan

DBSCAN on 3d coordinates doesn't find clusters


I'm trying to cluster points in a 3D coordinates DataFrame of 1428 points. The clusters are relatively flat planes that are elongated clouds DataFrame. They are very obvious clusters so I was hoping to try unsupervised clustering (not putting in the number of clusters expected) KMeans does not properly separate them and does require the number of clusters: Kmeans plot results

The data looks as follows:

                 5             6         7
0      9207.495280  18922.083277  4932.864
1      5831.199280   3441.735280  5756.326
2      8985.735280  12511.719280  7099.844
3      8858.223280  28883.151280  5689.652
4      6801.399277   6468.759280  7142.524
...            ...           ...       ...
1423  10332.927277  22041.855280  5136.252
1424   6874.971277  12937.563277  5467.216
1425   8952.471280  28849.887280  5710.522
1426   7900.611277  19128.255280  4803.122
1427  10234.635277  18734.631280  5631.286

[1428 rows x 3 columns]

I was hoping DBSCAN would deal better with this data. However, when I try the following (I played around with eps and min_samples but without success):

from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=10, min_samples = 50)
clusters = dbscan.fit_predict(X)

print('Clusters found', dbscan.labels_)
len(clusters)

I get this output:

Clusters found [-1 -1 -1 ... -1 -1 -1]

1428

I have been confused about getting this to work, especially since Kmeans did work:

kmeans = sk_cluster.KMeans(init='k-means++', n_clusters=9, n_init=50)
kmeans.fit_predict(X)
centroids = kmeans.cluster_centers_
kmeans_labels = kmeans.labels_
error = kmeans.inertia_
print ("The total error of the clustering is: ", error)
print ('\nCluster labels')
The total error of the clustering is:  4994508618.792263

Cluster labels
[8 0 7 ... 3 8 1]

Solution

  • Remember this golden rule:

    Always and always perform normalization on your data before feeding it to ML / DL algorithm.

    Reason being, your columns have different range, probably one column has a range of [10000,20000] and other has [4000,5000] when you will plot these coordinates on a graph, they will be very very far away, Clustering/Classification will never work, maybe Regression will. Scaling brings the range of each of the column to same level but still maintaining the distance but with different scale. It is just like in google MAPS, when you zoom in scale decrease and when you zoom out scale increases.

    You are free to choose the normalization algorithm, there are almost 20-30 available on sklearn.

    Edit:

    Use this code:

    from sklearn.preprocessing  import MinMaxScaler
    scaler = MinMaxScaler()
    scaler.fit(X)
    X_norm = scaler.transform(X)
    
    from sklearn.cluster import DBSCAN
    dbscan = DBSCAN(eps=0.05, min_samples = 3,leaf_size=30)
    clusters = dbscan.fit_predict(X_norm)
    
    np.unique(dbscan.labels_)
    
    
    array([-1,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,
           16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32,
           33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47])
    
    

    What I found that as DBSCAN is a density based approach and I Tried sklearn normalizer(from sklearn.preprocessing import normalize) which basically converts into gaussian distribution, but it didn't work and it should not in case of DBSCAN as it requires each feature to have similar density.

    So, I went with MinMax scaler as it should turn each features density similar and One thing to note, that as your data points after scaling, are less than 1, one should use epsilon in the similar range as well.