mysqlsqlalgorithmpopularity

Sql Popularity algorithm with weighted score


I'm implement an algorithm that returns popular posts at the moment, given his likes and dislikes.

To do this, for each post I add all his likes (1) and dislikes (-1) to get his score but each like/dislike is weighted : the latest, the heaviest. For example, at the moment an user likes a post, his like weights 1. After 1 day, it weights 0.95 (or -0.95 if it's a dislike), after 2 days, 0.90, and so on... With a minimal of 0.01 reached after 21 days. (PS: Theses are totally approximate values)

Here are how my tables are made :

Posts table

id | Title                 | user_id | ...
-------------------------------------------
1  | Random post           | 10      | ...
2  | Another post          | 36      | ...
n  | ...                   | n       | ...

Likes table

id | vote | post_id | user_id | created
----------------------------------------
1  | 1    | 2       | 10      | 2014-08-18 15:34:20
2  | -1   | 1       | 24      | 2014-08-15 18:54:12
3  | 1    | 2       | 54      | 2014-08-17 21:12:48 

Here is the SQL query I'm currently using which does the job

SELECT Post.*, Like.*, 
SUM(Like.vote * 
    (1 - IF((TIMESTAMPDIFF(MINUTE, Like.created, NOW()) / 60 / 24) / 21 > 0.99, 0.99, (TIMESTAMPDIFF(MINUTE, Like.created, NOW()) / 60 / 24) / 21))
   ) AS score 
FROM posts Post 
LEFT JOIN likes Like ON (Post.id = Like.post_id) 
GROUP BY Post.id
ORDER BY score DESC

PS: I'm using TIMESTAMPDIFF with MINUTE and not DAY directly because I'm calculating the day myself otherwise it returns me an integrer and I want a float value, in order to gradually decay overtime and not day per day. So TIMESTAMPDIFF(MINUTE, Like.created, NOW())/60/24 just gives me the number of day passed since the like creation with the decimal part.

Here are my questions :

  1. Look at the IF(expr1, expr2, expr3) part : it is necessary in order to set minimal value for the like's weight, so it will not go under 0.01 and become negative (and so the like, even older still has a little weight). But I'm calculating 2 times the same thing : expr1 is the same as expr2. Isn't there a way to avoid this duplicate expression ?
  2. I was going to cache this query and update it every 5 minutes, as I think it will be pretty heavy on a big Post and Like table. Is the cache really necessary or not ? I'm aiming to run this query on a table with 50 000 entries, and for each 200 associated likes (that makes a 10 000 000 entries Like table).
  3. Should I create Index in Like table for post_id ? And for created ?

Thank you !

EDIT: Imagine a Post can have multiple tags, and each tag can belong to multiple posts. If I want to get populars Posts given a Tag or multiple Tag, I can't cache each query ; as there is a good amount of possible queries. Is the query still viable so ?

EDIT FOR FINAL SOLUTION: I finally did some tests. I created a table Post with 30 000 entries and Like with 250 000 entries. Without index, the query was incredibly long (timed out > 10mn), but with indexes on Post.id (primary), Like.id(primary) and Like.post_id it took ~0.5s.

So I'm not caching the data, neither using update every 5mn. If the table keeps growing this is still possible solution (over 1s it's not acceptable).


Solution

  • 2: I was going to cache this query and update it every 5 minutes, as I think it will be pretty heavy on a big Post and Like table. Is the cache really necessary or not ? I'm aiming to run this query on a table with 50 000 entries, and for each 200 associated likes (that makes a 10 000 000 entries Like table).

    10000 and 50000 are considered small on current hardware. With those table sizes you probably won't need any cache, unless the query will run several times per second. Anyway, I would do a performance test before deciding to have a cache.

    3: Should I create Index in Like table for post_id ? And for created ?

    I would create an index for (post_id, created, vote). That way the query can get all information from the index and doesn't need to read the table at all.

    Edit (response to comments):

    An extra index will slow down inserts/updates slightly. In the end, the path you choose will dictate the characteristics of what you need in terms of CPU/RAM/Disk I/O. If you have enough RAM for the DB so that you expect the entire Like table to be cached in RAM then you might be better off with an index on just post_id.

    In terms of total load you need to consider the ratio between insert and select and the relative cost of insert and select with or without the index. My gut feeling is that the total load will be lower with the index.

    Regarding your question on concurrency (selecting and inserting simultaneously). What happens depends on the isolation level. The general advice is to keep inserts/updates as short as possible. If you don't do unneccessary things between the start of the insert and the commit you should be fine.