pythonscikit-learncollaborative-filteringrecommendation-engine

Basic filtering of data based on user & item in Python SciKit


I am trying to implement a recommender system to users based on their rating. I think the most common one. I was reading alot and shortlisted Surprise, a python-scikit based recommender systems.

While am able to import data and run prediction, its not exactly as I would want it.

Right now what I have: I can pass a user_id, item_id and rating and get the probability of that user giving the rating I passed.

What I really want to do: Pass a user_id and in return get a list of items that would be potentially liked/rated highly by that user based on the data.

from surprise import Reader, Dataset    
from surprise import SVD, evaluate

# Define the format
reader = Reader(line_format='user item rating timestamp', sep='\t')
# Load the data from the file using the reader format
data = Dataset.load_from_file('./data/ecomm/e.data', reader=reader)    

# Split data into 5 folds
data.split(n_folds=5)

algo = SVD()

# Retrieve the trainset.
trainset = data.build_full_trainset()
algo.fit(trainset)

//Inputs are: user_id, item_id & rating.
print algo.predict(3, 107, 1)

Sample lines from data file.

First column is user_id, 2nd is item id, 3rd is rating and then timestamp.

196 242 3   881250949
186 302 3   891717742
22  377 1   878887116
244 51  2   880606923
166 346 1   886397596
298 474 4   884182806
115 265 2   881171488
253 465 5   891628467
305 451 3   886324817
6   86  3   883603013

Solution

  • You need to iterate through all possible item_id values for a single user_id and predict its rating. Then you collect the highest rated items to recommend to that user.

    But make sure that the user_id, item_id pair is not in the training dataset. Something like this function here:

    build_anti_testset

    Return a list of ratings that can be used as a testset in the test() method.

    The ratings are all the ratings that are not in the trainset, i.e. all the ratings rui where the user u is known, the item i is known, but the rating rui is not in the trainset. As rui is unknown, it is either replaced by the fill value or assumed to be equal to the mean of all ratings global_mean.

    After that you can pass these pairs to the test() or predict() method and collect the ratings, and get the top N recommendations from this data for a particular user.

    An example of this is given here: