I have data that resembles this:
import pandas as pd
import random
random.seed(901)
rand_list1= []
rand_list2= []
rand_list3= []
rand_list4= []
rand_list5= []
for i in range(20):
x = random.randint(80,1000)
rand_list1.append(x/100)
y1 = random.randint(-200,200)
rand_list2.append(y1/10)
y2 = random.randint(-200,200)
rand_list3.append(y2/10)
y3 = random.randint(-200,200)
rand_list4.append(y3/10)
y4 = random.randint(-200,200)
rand_list5.append(y4/10)
df = pd.DataFrame({'Rainfall Recorded':rand_list1, 'TAXI A':rand_list2, 'TAXI B':rand_list3, 'TAXI C':rand_list4, 'TAXI D':rand_list5})
df.head()
Rainfall Recorded TAXI A TAXI B TAXI C TAXI D
0 5.21 13.7 -5.0 -14.2 9.8
1 2.39 -0.3 18.8 4.8 -6.4
2 8.09 15.0 -3.6 18.6 12.7
3 5.79 -0.2 14.6 0.9 3.8
4 7.48 10.9 9.0 15.4 -16.5
Given the Rainfall recorded in our region in centimeters, these are the % change in earnings reported by TAXI drivers surveyed. Can I use K MEANS CLUSTERING
to determine whether the TAXIS operated in our locality or not? Suppose there is relationship between Rainfall recorded and the Earnings change.
I have simple code got from web source:
km = KMeans(n_clusters=2)
y_predicted = km.fit_predict(df[['TAXI','Rainfall Recorded']])
y_predicted
But I am unsure what transformations need to be done before using this code.
import pandas as pd
from sklearn.cluster import KMeans
import numpy as np
import random
random.seed(901)
rand_list1 = []
rand_list2 = []
rand_list3 = []
rand_list4 = []
rand_list5 = []
for i in range(20):
x = random.randint(80, 1000)
rand_list1.append(x / 100)
y1 = random.randint(-200, 200)
rand_list2.append(y1 / 10)
y2 = random.randint(-200, 200)
rand_list3.append(y2 / 10)
y3 = random.randint(-200, 200)
rand_list4.append(y3 / 10)
y4 = random.randint(-200, 200)
rand_list5.append(y4 / 10)
df = pd.DataFrame({
'Rainfall Recorded': rand_list1,
'TAXI A': rand_list2,
'TAXI B': rand_list3,
'TAXI C': rand_list4,
'TAXI D': rand_list5
})
# Number of clusters
k = 2
# Function to apply k-means to each row
def cluster_row(row, n_clusters):
# Extract the taxi data
taxi_data = row[['TAXI A', 'TAXI B', 'TAXI C', 'TAXI D']].values.reshape(-1, 1)
kmeans = KMeans(n_clusters=n_clusters, random_state=0)
kmeans.fit(taxi_data)
return kmeans.labels_
# Apply the function to each row and store the cluster labels
df['Taxi Clusters'] = df.apply(lambda row: cluster_row(row, k), axis=1)
print(df)
This gives Taxi Clusters
for each row of the entries made recording the rainfall received in our locality.