I have a large set of 'Vehicle speed vs Engine RPM' values for a vehicle. I'm try to predict the time spent by the vehicle on each gear.
I ran K-Means clustering on the dataset and got the following result:
Clearly, my algorithm has failed to capture the evident pattern. I want to force K-Means (or any other clustering algorithm, for that matter) to cluster data along the six sloped lines. Snippet of relevant code:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans
plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')
# Importing the dataset
data = pd.read_csv('speedRpm.csv')
print(data.shape)
data.head()
# Getting the data points
f1 = data['rpm'].values
f2 = data['speed'].values
X = np.array(list(zip(f1, f2)))
# Number of clusters
k = 5
kmeans = KMeans(n_clusters=k)
# Fitting the input data
kmeans = kmeans.fit(X)
# Getting the cluster labels
labels = kmeans.predict(X)
# Centroid values
centroids = kmeans.cluster_centers_
labeled_array = {i: X[np.where(kmeans.labels_ == i)] for i in range(kmeans.n_clusters)}
colors = ['r', 'g', 'b', 'y', 'c']
fig, ax = plt.subplots()
for i in range(k):
points = np.array([X[j] for j in range(len(X)) if kmeans.labels_[j] == i])
ax.scatter(points[:, 0], points[:, 1], s=7, c=colors[i])
ax.scatter(centroids[:, 0], centroids[:, 1], marker='*', s=200, c='#050505')
plt.show()
How do I make sure the clustering algorithm captures the right pattern, even though it possibly isn't the most efficient?
Thanks!
EDIT:
Ran the same set of points using DBSCAN this time. After playing around with the eps
and min_samples
values for sometime, got the following result:
Although, still not perfect and way too many outliers, the algorithm is beginning to capture the linear trend.
Code:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')
# Importing the dataset
data = pd.read_csv('speedRpm.csv')
print(data.shape)
data.head()
# Getting the values and plotting it
f1 = data['rpm'].values
f2 = data['speed'].values
X = np.array(list(zip(f1, f2)))
# DBSCAN
# Compute DBSCAN
db = DBSCAN(eps=1.1, min_samples=3).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print "Estimated Number of Clusters", n_clusters_
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
There are two major options here:
Minor option:
Python has a good description of several clustering algorithms here . From the link, a (crudely cropped) helpful graphic:
This row looks similar to your dataset; have you tried a Gaussian mixture model? A GMM has few well known theoretical properties, but it works by assigning probabilities that points belong to each cluster center based on a posterior calculated from the data. You can often initialize it with kmeans, which Sklearn does for you.
Similarly, desnity-based clustering algorithms (DBSCAN, e.g.), seem like a logical choice. Your data has a nice segmentation of dense clusters, and this seems like a good topological property to be filtering for. In the image on the linked wikipedia page:
they offer the caption:
DBSCAN can find non-linearly separable clusters. This dataset cannot be adequately clustered with k-means
which seems to speak to your troubles.
Kmeans is an extremely versatile algorithm, but it is not globally optimal and suffers from a lot of weak-points. Here is dense reading
In addition to problems like the mickey mouse problem, kmeans is often trying to minimize simple euclidean distance to the centroids. While this makes a lot of sense for a lot of problems, it doesn't make sense in yours, where the skew of the clusters means that isn't quite the correct measure. Notice that other algorithms like agglomerative/hierarchical clustering, shown above, that use similar measures, have similar trappings.
I haven't covered transforming your data or tweaking kmeans because the latter requires actually hacking into (or writing your own) clustering algorithm (I don't recommend for a simple exploratory problem given the coverage of sklearn and similar packages), where the former seems like a local solution sensitive to your exact data. ICA might be a decent start, but there's a lot of options for that task