[SOLVED] Binarize the ratings - MovieLens dataset

Binarize the ratings - MovieLens dataset

I am working on a personalised news recommendation engine based on click-behaviour of users. My features will be predefined news categories (such as politics, sport and etc).

Whenever user clicks on an article, I build/update user profile based on this article, then recommend another article from articles pool.

Regarding evaluation of this system, I need to have a dataset which contains binary user-item interactions (user clicked on recommended article or not) - which I couldn't find an appropriate dataset for this specific context. What I'm trying to do is, binarize Movielens dataset, then calculate precision and recall.

What I actually do in MovieLens dataset is as follows: if the rating for an item, by a user, is larger than the average rating by this user I assign it a binary rating of 1, 0 otherwise.

Is this approach right way to evaluate such kind of systems?

Solution

binarizing makes no difference. Precision and recall are relative so the fact that someone rated is all you need. The algo for a "good" rating is meaningless for testing purposes.
epinions has two dataset, one for ratings, the other binary for trust.
use MAP@k mean average precision for some number of recommendations. This will take account of the ranking in a group of recommendations, which is no, doubt how they will be used.

BTW there is already a recommender in open source that does this, and allows mixing multiple events/actions/indicators and can also use content similarity here. It is based on PredictionIO's framework, which is Spark based.