I'm implement an algorithm that returns popular posts at the moment, given his likes and dislikes.
To do this, for each post I add all his likes (1) and dislikes (-1) to get his score but each like/dislike is weighted : the latest, the heaviest. For example, at the moment an user likes a post, his like weights 1. After 1 day, it weights 0.95 (or -0.95 if it's a dislike), after 2 days, 0.90, and so on... With a minimal of 0.01 reached after 21 days. (PS: Theses are totally approximate values)
Here are how my tables are made :
Posts table
id | Title | user_id | ...
-------------------------------------------
1 | Random post | 10 | ...
2 | Another post | 36 | ...
n | ... | n | ...
Likes table
id | vote | post_id | user_id | created
----------------------------------------
1 | 1 | 2 | 10 | 2014-08-18 15:34:20
2 | -1 | 1 | 24 | 2014-08-15 18:54:12
3 | 1 | 2 | 54 | 2014-08-17 21:12:48
Here is the SQL query I'm currently using which does the job
SELECT Post.*, Like.*,
SUM(Like.vote *
(1 - IF((TIMESTAMPDIFF(MINUTE, Like.created, NOW()) / 60 / 24) / 21 > 0.99, 0.99, (TIMESTAMPDIFF(MINUTE, Like.created, NOW()) / 60 / 24) / 21))
) AS score
FROM posts Post
LEFT JOIN likes Like ON (Post.id = Like.post_id)
GROUP BY Post.id
ORDER BY score DESC
PS: I'm using TIMESTAMPDIFF
with MINUTE
and not DAY
directly because I'm calculating the day myself otherwise it returns me an integrer and I want a float value, in order to gradually decay overtime and not day per day. So TIMESTAMPDIFF(MINUTE, Like.created, NOW())/60/24
just gives me the number of day passed since the like creation with the decimal part.
Here are my questions :
IF(expr1, expr2, expr3)
part : it is necessary in order to set minimal value for the like's weight, so it will not go under 0.01 and become negative (and so the like, even older still has a little weight). But I'm calculating 2 times the same thing : expr1 is the same as expr2. Isn't there a way to avoid this duplicate expression ?Post
and Like
table. Is the cache really necessary or not ? I'm aiming to run this query on a table with 50 000 entries, and for each 200 associated likes (that makes a 10 000 000 entries Like
table).Like
table for post_id ? And for created ?Thank you !
EDIT: Imagine a Post
can have multiple tags, and each tag can belong to multiple posts. If I want to get populars Posts given a Tag or multiple Tag, I can't cache each query ; as there is a good amount of possible queries. Is the query still viable so ?
EDIT FOR FINAL SOLUTION: I finally did some tests. I created a table Post with 30 000 entries and Like with 250 000 entries. Without index, the query was incredibly long (timed out > 10mn), but with indexes on Post.id (primary), Like.id(primary) and Like.post_id it took ~0.5s.
So I'm not caching the data, neither using update every 5mn. If the table keeps growing this is still possible solution (over 1s it's not acceptable).
2: I was going to cache this query and update it every 5 minutes, as I think it will be pretty heavy on a big Post and Like table. Is the cache really necessary or not ? I'm aiming to run this query on a table with 50 000 entries, and for each 200 associated likes (that makes a 10 000 000 entries Like table).
10000 and 50000 are considered small on current hardware. With those table sizes you probably won't need any cache, unless the query will run several times per second. Anyway, I would do a performance test before deciding to have a cache.
3: Should I create Index in Like table for post_id ? And for created ?
I would create an index for (post_id, created, vote). That way the query can get all information from the index and doesn't need to read the table at all.
Edit (response to comments):
An extra index will slow down inserts/updates slightly. In the end, the path you choose will dictate the characteristics of what you need in terms of CPU/RAM/Disk I/O.
If you have enough RAM for the DB so that you expect the entire Like
table to be cached in RAM then you might be better off with an index on just post_id
.
In terms of total load you need to consider the ratio between insert
and select
and the relative cost of insert and select with or without the index.
My gut feeling is that the total load will be lower with the index.
Regarding your question on concurrency (selecting and inserting simultaneously). What happens depends on the isolation level. The general advice is to keep inserts/updates as short as possible. If you don't do unneccessary things between the start of the insert
and the commit
you should be fine.