I'm currently building a book recommendation system and I want to use KNN algorithm for collaborative filtering. I think I know the process of KNN algorithm well, and I want to use item-based approach for which I need to calculate the similarity between item vectors. However, there's a difference between the similarity calculated by the library and the one I calculated myself, and I'm not sure what the cause is. Can you help me out?
from surprise import Dataset, Reader, KNNWithMeans
# 데이터프레임 생성
ratings_dict = {
"item": [1, 2, 1, 2, 1, 2, 1, 2, 1],
"user": ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D', 'E'],
"rating": [1, 2, 2, 4, 2.5, 4, 4.5, 5, 3],
}
df = pd.DataFrame(ratings_dict)
# Surprise 라이브러리에서 사용할 데이터셋 형태로 변환
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['user', 'item', 'rating']], reader)
# 유사도 행렬 계산 (item_based)
sim_options = {'name': 'cosine', 'user_based': False}
algo = KNNWithMeans(sim_options=sim_options)
trainingSet = data.build_full_trainset()
algo.fit(trainingSet)
similarity_matrix = algo.compute_similarities()
print(similarity_matrix)
this code results
[[1. 0.96954671] [0.96954671 1. ]]
item 1 2
user
A 1.0 2.0
B 2.0 4.0
C 2.5 4.0
D 4.5 5.0
E 3.0 NaN
but
import numpy as np
# 두 벡터 정의
vector1 = np.array([1, 2, 2.5, 4.5, 3])
vector2 = np.array([2, 4, 4, 5, 0])
# 코사인 유사도 계산
cosine_sim_1 = np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))
print(cosine_sim_1)
this code results
0.8550598237348973
I think the surprise library filled NaN values with something other than 0. I expected it to be 0, but it seems like another value was used instead.
I tried ChatGPT, but it couldn't help me solve the issue.
vector1 = np.array([1, 2, 2.5, 4.5])
vector2 = np.array([2, 4, 4, 5])
# 코사인 유사도 계산
cosine_sim_1 = np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))
print(cosine_sim_1)
The first part of your code just calculates the cosine similarity of the 4D vectors, omitting the last values, one of which is NaN