sqlmathweighted-averagerelevance

Count relevance/weighted arithmetic mean SQL


I have a table of movies and a table of reviews

movies reviews

In my app, I want to show top 10 movies of any genre.

I clearly cannot sort movies just by rating since there are movies with only 1 5 star review, thus only irrelevant movies will be recommended to users.

Currently I receive from DB top 100 movies in this genre sorted by reviews, sort this list by rating on server and only then display top 10.

That kinda works but this solution is impractical in case of e.g. review bombing and moreover, the purpose of top 10 list is to recommend the most relevant movies.

My idea was to add relevance column into the movies table but I've got no clue how to count it:

  1. (amount of 5 star reviews * 5 ) + (amount of 4 star reviews * 4 ) and so on - no
  2. (amount of 5 star reviews * 1 ) + (amount of 4 star reviews * 0.8) + ... + (amount of 0 star reviews * 0.1) - no
  3. total amount of reviews / avgrating - no
  4. ((amount of 5 star reviews * 5 ) + (amount of 4 star reviews * 4 ) and so on) / amount of reviews total - mb, I'm not sure what about 0

Moreover, the rating in ratings is not a real number. User can give only 5, 4.5, 4 etc. score review. But what about the situation where users can rate movies like 5, 4.9, 4.8 ... 0.1?

So, how to perform this operation in better way?

[Upd] I think instead of division of smth. we should multiply averagerating and reviews from movies in order to count the relevance (averagerating and reviews are alredy automatically updated on each insert/delete/update). Also we should try to normalize the product.

In this situation movies with 100 reviews of 5 and averagerating of 5 won't beat up movies with averagerating 3.8 but with 57k reviews and also the problem of review bombing will be solved.

Can anyone prove my guess?


Solution

  • I agree with @NickW that this is more of a statistics question than a programming question, but I'll try to answer it, anyway.

    If you want to account for both average rating and number of ratings, a straightforward method is to multiply the two. This gives you the sum of all ratings, but, as @qwezxc789 notes, this does not account for the number of zero ratings. Another strategy could be a linear combination of avgrating and reviews. Collinearity shouldn't be an issue because neither variable depends on the other. You could even play around with the linear coefficients to change the relative contribution of each variable. This solution easily generalizes to n independent variables.

    Let wi be the weight of predictor i, 1 ≤ in, w1 + ... + wn = 1 (or any other constant, but why not use 1?).

    You can add this value as a new relevance column in the movies table using the following SQL. I use two equally weighted predictors: ratings and reviews.

    ALTER TABLE movies ADD [relevance] AS
    (SELECT 0.5 * avgrating + 0.5 * reviews
    FROM movies)