python machine-learning recommendation-engine matrix-factorization

Evaluating the LightFM Recommendation Model

I've been playing around with lightfm for quite some time and found it really useful to generate recommendations. However, there are two main questions that I would like to know.

to evaluate the LightFM model in case where the rank of the recommendations matter, should I rely more on precision@k or other provided evaluation metrics such as AUC score? in what cases should I focus on improving my precision@k compared to other metrics? or maybe are they highly correlated? which means if I manage to improve my precision@k score, the other metrics would follow, am I correct?
how would you interpret if a model that trained using WARP loss function has a score 0.089 for precision@5 ? AFAIK, Precision at 5 tells me what proportion of the top 5 results are positives/relevant. which means I would get 0 precision@5 if my predictions could not make it to top 5 or I will get 0.2 if I got only one predictions correct in the top 5. But I cannot interpret what 0.0xx means for precision@n

Thanks

Solution

Precision@K and AUC measure different things, and give you different perspectives on the quality of your model. In general, they should be correlated, but understanding how they differ may help you choose the one that is more important for your application.

Precision@K measures the proportion of positive items among the K highest-ranked items. As such, it's very focused on the ranking quality at the top of the list: it doesn't matter how good or bad the rest of your ranking is as long as the first K items are mostly positive. This would be an appropriate metric if you are only ever going to be showing your users the very top of the list.
AUC measures the quality of the overall ranking. In the binary case, it can be interpreted as the probability that a randomly chosen positive item is ranked higher than a randomly chosen negative item. Consequently, an AUC close to 1.0 will suggest that, by and large, your ordering is correct: and this can be true even if none of the first K items are positives. This metric may be more appropriate if you do not exert full control on which results will be presented to the user; it may be that the first K recommended items are not available any more (say, they are out of stock), and you need to move further down the ranking. A high AUC score will then give you confidence that your ranking is of high quality throughout.

Note also that while the maximum value of the AUC metric is 1.0, the maximum achievable precision@K is dependent on your data. For example, if you measure precision@5 but there is only one positive item, the maximum score you can achieve is 0.2.

In LightFM, the AUC and precision@K routines return arrays of metric scores: one for every user in your test data. Most likely, you average these to get a mean AUC or mean precision@K score: if some of your users have score 0 on the precision@5 metric, it is possible that your average precision@5 will be between 0 and 0.2.

Hope this helps!