pythoncluster-analysisdbscanoptics-algorithm

Clustering algorithm for snake like clusters


I'm searching for a good algorithm to identify data clusters where the clusters tend to be linear, sort of snake like clusters. I've tried a number of standard clustering algorithms like DBSCAN, OPTICS, HDBSCAN, and RobustSingleLinkage, but they all kind of look like the picture below where it gets all muddled up between the snake clusters and the regular clusters. Does anyone know a good clustering algorithm to solve this?

enter image description here

Anony-Mousse's answer was pretty helpful. I'm going to add a little detail to show how I applied it. I used the DBSCAN, adjusted the scale of the X axis and the DBSCAN eps value until it started picking up more of the horizontal clusters. This is pretty effective, close enough for my purposes.

scan = cluster.DBSCAN(eps=20, min_samples=10, metric="l1", n_jobs=-1)
X_val[:, 0] = X_val[:, 0]/20000
scan.fit(X_val)
y_pred = scan.labels_.astype(np.int) + 1
# y_pred = np.where(y_pred > 0, 1, 0)
plt.scatter(X.iloc[:, 0]/20000, X.iloc[:, 1], color=colors[y_pred])

enter image description here


Solution

  • Don't try to solve this by trial-and-error.

    Understand your problem, understand your data, then choose the algorithm.

    1. Your x axis appears to be a sequence number
    2. Your y axis appears to be a measurement

    Euclidean distance on (x,y) does not make much sense, does it?

    Instead, you'll want to have some thresholding. In fact, a variation of DBSCAN, known as Generalized DBSCAN, makes most sense on such data.

    You want points to be in a cluster if:

    1. They differ on the x axis by at most dx=100
    2. They differ on the y axis by at most dy=10
    3. There are at least 10 points there

    As you appear to be using python, for which I do not know any implementation of GeneralizedDBSCAN, you will have to "hack" DBSCAN into simulating this behavior. Try the following: scale the y axis by dx/dy (here: 10). Then try DBSCAN with radius eps=dx, min_samples=10 and Manhattan metric metric="l1". Since sklearn also does not have maximum norm, you could also rotate by 45 degrees and use a larger radius to get closer to what Generalized DBSCAN would give with above rules. But the most important thing is to adjust the weighting of the two features (do not use heuristic normalization, but prefer interpretable values based on the problem!)