machine-learningcluster-analysisdata-scienceyelp

Which clustering model can I use to predict the following outcome?


I have three columns in my dataset. This is the list of restaurants that come under the category 'pizza'.This data was derived from the yelp dataset.There are three columns for each restaurant present. Latitude,Longitude,Checkins. I am supposed to build a model where I should be able to predict the coordinates(latitude,longitude) where I should start a new restaurant so that the number of checkins can be high. There are totally 4951 rows

    checkins   latitude   longitude
0            2  33.394877 -111.600194
1            2  43.841217  -79.303936
2            1  40.442828  -80.186293
3            1  41.141631  -81.356603
4            1  40.434399  -79.922983
5            1  33.552870 -112.133712
6            1  43.686836  -79.293838
7            2  41.131282  -81.490180
8            1  40.500796  -79.943429
9           12  36.010086 -115.118656
10           2  41.484475  -81.921150
11           1  43.842450  -79.027990
12           1  43.724840  -79.289919
13           2  45.448630  -73.608719
14           1  45.577027  -73.330855
15           1  36.238059 -115.210341
16           1  33.623055 -112.339758
17           1  43.762768  -79.491417
18           1  43.708415  -79.475884
19           1  45.588257  -73.428926
20           4  41.152875  -81.358754
21           1  41.608833  -81.525020
22           1  41.425152  -81.896178
23           1  43.694716  -79.304879
24           1  40.442147  -79.956513
25           1  41.336466  -81.784790
26           1  33.231942 -111.721218
27           2  36.291436 -115.287016
28           2  33.641847 -111.995571
29           1  43.570217  -79.566431
...        ...        ...         ...

I tried to approach the problem with clustering using DBSCAN and ended with the following graph. But I am not able to make any sense of it. How do I Proceed further or how do I approach the problem in a different way to get my results?

import pandas as pd
from sklearn.cluster import DBSCAN
import numpy as np
import matplotlib.pyplot as plt
review=pd.read_csv('pizza_category.csv')
checkin=pd.read_csv('yelp_academic_dataset/yelp_checkin.csv')

final=pd.merge(review,checkin,on='business_id',how='inner')
final.dropna()
final=final.reset_index(drop=True)
X=final[['checkins']]
X['latitude']=final[['latitude']].astype(dtype=np.float64).values
X['longitude']=final[['longitude']].astype(dtype=np.float64).values
print(X)
arr=X.values
db = DBSCAN(eps=2,min_samples=5)
y_pred = db.fit_predict(arr)
plt.figure(figsize=(20,10))
plt.scatter(arr[:, 0], arr[:, 1], c=y_pred, cmap="plasma")
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")

Here's the plot I got enter image description here


Solution

  • This is not a clustering problem.

    What you want to do is density estimation, where you estimate density based on previous check-in frequencies.