I am trying to work on a recommendation system in R
.
Data Set below:
https://drive.google.com/file/d/1FVh-Xg3NBtzKgZHnDTi7IjaATW_fPmW9/view?usp=sharing
beer_data <- read.csv("beer_data.csv", stringsAsFactors = F)
library(recommenderlab)
r <- as(beer_data, "realRatingMatrix")
Now if we check the number of reviews in each object, both are not matching
nrow(beer_data) # 475984
length(getRatings(r)) # 474560
And also range of rating is not matching :
> range(beer_data_master$review_overall)
[1] 0 5
> range(getRatings(r))
[1] 0 15
I have checked with other data set too, there is no issue appearing.
I got the answer:
There are some users in the data who have rated the same beer more than once (twice/thrice... etc.). So recommenderLabs when coercing data into realRatingMatrix adds the rating of such rows and that's why value of ratings are more than 5 and length of getRatings is less than nrow of beer_data.
E.g. sample beer_data
beer_beerid, review_profilename, review_overall
19667, 57md, 3.5 19667, 57md, 4.0
so in realRatingMatrix for user="57md" and item = "19667" rating = 3.5+4 = 7.5 and 1 row gets reduced in realRatingMatrix.
And due to the same reason, non unique combination of beer_beerid and rating getting combined which is causing mismatch in count of rating in both objects, dataframe and realRatingMatrix.