I want to calculate the row wise cosine similarity between every consecutive row. The dataframe is already sorted on the id
and date
.
I tried looking at the solutions here in stack overflow, but the use case seems to be a bit different in all the cases. I have many more features, around 32 in total, and I want to consider all those feature columns (Paths modified, tags modified and endpoints added in the df above are examples of some features), and calculate the distance metric for each row.
This is what I could think of,but it does not fulfil the purpose:
df = pd.DataFrame(np.random.randint(0, 5, (3, 5)), columns=['id', 'date', 'feature1', 'feature2', 'feature3'])
similarity_df = df.iloc[:, 2:].apply(lambda x: cosine_similarity([x], df.iloc[:, 2:])[0], axis=1)
Does anyone have suggestions on how could I proceed with this?
I was able to figure it how somehow, the loop is something I was looking for, since some of the api_spec_id's
were not getting assigned NaN
and the distance was getting calculated which is wrong.
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# Feature columns to use for cosine similarity calculation
cols_to_use = labels.loc[:, "Info_contact_name_changes":"Paths_modified"].columns
# New column for cosine similarity
labels['cosine_sim'] = np.nan
# Looping through each api_spec_id
for api_spec_id in labels['api_spec_id'].unique():
# Get the rows for the current api_spec_id
api_rows = labels[labels['api_spec_id'] == api_spec_id].sort_values(by='commit_date')
# Set the cosine similarity of the first row to NaN, since there is no previous row to compare to
labels.loc[api_rows.index[0], 'cosine_sim'] = np.nan
# Calculate the cosine similarity for consecutive rows
for i in range(1, len(api_rows)):
# Get the previous and current row
prev_row = api_rows.iloc[i-1][cols_to_use]
curr_row = api_rows.iloc[i][cols_to_use]
# Calculate the cosine similarity and store it in the 'cosine_sim' column
cosine_sim = cosine_similarity([prev_row], [curr_row])[0][0]
labels.loc[api_rows.index[i], 'cosine_sim'] = cosine_sim