Say i have the following dataframe stored as a variable called coordinates, where the first few rows look like:
business_lat business_lng business_rating
0 19.111841 72.910729 5.
1 19.111342 72.908387 5.
2 19.111342 72.908387 4.
3 19.137815 72.914085 5.
4 19.119677 72.905081 2.
5 19.119677 72.905081 2.
. . .
. . .
. . .
As you can see this data is geospatial (has a lat and a lng) AND every row has an additional value, business_rating, that corresponds to the rating of the business at the latlng in that row. I want to cluster the data, where businesses that are nearby and have similar ratings are assigned into the same cluster. Essentially I need a a geospatial cluster with an additional requirement that the clustering must consider the rating column.
I've looked online and can't really find much addressing approaches for this: only things for strict geospatial clustering (only features to cluster on are latlng) or non spatial clustering.
I have a simple DBSCAN running below, but when i plot the results of the clustering it does not seem to be doing what I want correctly.
from sklearn.cluster import DBSCAN
import numpy as np
db = DBSCAN(eps=2/6371., min_samples=5, algorithm='ball_tree', metric='haversine').fit(np.radians(coordinates))
Would I be better served trying to tweak the parameters of the DBSCAN, doing some additional processing of the data or using a different approach all together?
The tricky part about clustering two different types of information (location and rating) is determining how they should relate to each other. It's simple to ask when it is just one domain and you are comparing the same units. My approach would be to look at how to relate rows within a domain and then determine some interaction between the domains. This could be done using scaling options like MinMaxScaler mentioned, however, I think this is a bit heavy handed and we could use our knowledge of the domains to cluster better.
Location distance is best handled directly as this has real world meaning that we can precalculate distances for. The meaning of meters apart is direct to what we
You could use the scaling option mentioned in the previous answer but this risks distorting the location data. For example, if you have a long and thin set of locations, MinMaxScaling would give more importance to variation on the thin axis than the long axis. If you are going to use scaling, do it on the computed distance matrix, not on the lat lon themselves.
import numpy as np
from sklearn.metrics.pairwise import haversine_distances
points_in_radians = df[['business_lat','business_lng']].apply(np.radians).values
distances_in_km = haversine_distances(points_in_radians) * 6371
We can think of the problem through asking a couple of questions that relate rating to distance. We could ask, how different must ratings be to separate observations in the same place? What is the meter difference to rating difference ratio? With an idea of ratio, we can calculate another distance matrix for the rating difference for all observations and use this to scale or add on the original location distance matrix or we could increase the distance for every gap in rating. This location-plus-ratings-difference matrix can then be clustered on.
from sklearn.metrics.pairwise import euclidean_distances
added_km_per_rating_gap = 1
rating_distances = euclidean_distances(df[['business_rating']].values) * added_km_per_rating_gap
We can then simply add these together and cluster on the resulting matrix.
from sklearn.cluster import DBSCAN
distance_matrix = rating_distances + distances_in_km
clustering = DBSCAN(metric='precomputed', eps=1, min_samples=2)
clustering.fit(distance_matrix)
What we have done is cluster by location, adding a penalty for ratings difference. Making that penalty direct and controllable allows for optimisation to find the best clustering.
The problem I'm finding is that (with my test data at least) DBSCAN has a tendency to 'walk' from observation to observation forming clusters that either blend ratings together because the penalty is not high enough or separates into single rating groups. It might be that DBSCAN is not suitable for this type of clustering. If I had more time, I would look for some open data to test this on and try other clustering methods.
Here is the code I used to test. I used the square of the ratings distance to emphasise larger gaps.
import random
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=300, centers=6, cluster_std=0.60, random_state=0)
ratings = np.array([random.randint(1,4) for _ in range(len(X)//2)] \
+[random.randint(2,5) for _ in range(len(X)//2)]).reshape(-1, 1)
distances_in_km = euclidean_distances(X)
rating_distances = euclidean_distances(ratings)
def build_clusters(multiplier, eps):
rating_addition = (rating_distances ** 2) * multiplier
distance_matrix = rating_addition + distances_in_km
clustering = DBSCAN(metric='precomputed', eps=eps, min_samples=10)
clustering.fit(distance_matrix)
return clustering.labels_