pythonscikit-learncluster-analysisgeospatialdbscan

Clustering geospatial data on coordinates AND non spatial feature


Say i have the following dataframe stored as a variable called coordinates, where the first few rows look like:

   business_lat  business_lng  business_rating
0   19.111841     72.910729           5.
1   19.111342     72.908387           5.
2   19.111342     72.908387           4.
3   19.137815     72.914085           5.
4   19.119677     72.905081           2.
5   19.119677     72.905081           2.
        .             .               .
        .             .               .
        .             .               .

As you can see this data is geospatial (has a lat and a lng) AND every row has an additional value, business_rating, that corresponds to the rating of the business at the latlng in that row. I want to cluster the data, where businesses that are nearby and have similar ratings are assigned into the same cluster. Essentially I need a a geospatial cluster with an additional requirement that the clustering must consider the rating column.

I've looked online and can't really find much addressing approaches for this: only things for strict geospatial clustering (only features to cluster on are latlng) or non spatial clustering.

I have a simple DBSCAN running below, but when i plot the results of the clustering it does not seem to be doing what I want correctly.

from sklearn.cluster import DBSCAN
import numpy as np
db = DBSCAN(eps=2/6371., min_samples=5, algorithm='ball_tree', metric='haversine').fit(np.radians(coordinates))

Would I be better served trying to tweak the parameters of the DBSCAN, doing some additional processing of the data or using a different approach all together?


Solution

  • The tricky part about clustering two different types of information (location and rating) is determining how they should relate to each other. It's simple to ask when it is just one domain and you are comparing the same units. My approach would be to look at how to relate rows within a domain and then determine some interaction between the domains. This could be done using scaling options like MinMaxScaler mentioned, however, I think this is a bit heavy handed and we could use our knowledge of the domains to cluster better.

    Handling Location

    Location distance is best handled directly as this has real world meaning that we can precalculate distances for. The meaning of meters apart is direct to what we

    You could use the scaling option mentioned in the previous answer but this risks distorting the location data. For example, if you have a long and thin set of locations, MinMaxScaling would give more importance to variation on the thin axis than the long axis. If you are going to use scaling, do it on the computed distance matrix, not on the lat lon themselves.

    import numpy as np
    from sklearn.metrics.pairwise import haversine_distances
    
    
    points_in_radians = df[['business_lat','business_lng']].apply(np.radians).values
    distances_in_km = haversine_distances(points_in_radians) * 6371
    

    Adding in Rating

    We can think of the problem through asking a couple of questions that relate rating to distance. We could ask, how different must ratings be to separate observations in the same place? What is the meter difference to rating difference ratio? With an idea of ratio, we can calculate another distance matrix for the rating difference for all observations and use this to scale or add on the original location distance matrix or we could increase the distance for every gap in rating. This location-plus-ratings-difference matrix can then be clustered on.

    from sklearn.metrics.pairwise import euclidean_distances
    
    added_km_per_rating_gap = 1
    rating_distances = euclidean_distances(df[['business_rating']].values) * added_km_per_rating_gap 
    

    We can then simply add these together and cluster on the resulting matrix.

    from sklearn.cluster import DBSCAN
    
    distance_matrix = rating_distances + distances_in_km
    
    clustering = DBSCAN(metric='precomputed', eps=1, min_samples=2)
    clustering.fit(distance_matrix)
    

    What we have done is cluster by location, adding a penalty for ratings difference. Making that penalty direct and controllable allows for optimisation to find the best clustering.

    Testing

    The problem I'm finding is that (with my test data at least) DBSCAN has a tendency to 'walk' from observation to observation forming clusters that either blend ratings together because the penalty is not high enough or separates into single rating groups. It might be that DBSCAN is not suitable for this type of clustering. If I had more time, I would look for some open data to test this on and try other clustering methods.

    Here is the code I used to test. I used the square of the ratings distance to emphasise larger gaps.

    import random
    from sklearn.datasets import make_blobs
    
    
    X, y = make_blobs(n_samples=300, centers=6, cluster_std=0.60, random_state=0)
    ratings = np.array([random.randint(1,4) for _ in range(len(X)//2)] \
              +[random.randint(2,5) for _ in range(len(X)//2)]).reshape(-1, 1)
    
    distances_in_km = euclidean_distances(X)
    rating_distances = euclidean_distances(ratings)
    
    
    def build_clusters(multiplier, eps):
        rating_addition = (rating_distances ** 2) * multiplier
        distance_matrix = rating_addition + distances_in_km
        clustering = DBSCAN(metric='precomputed', eps=eps, min_samples=10)
        clustering.fit(distance_matrix)
        return clustering.labels_