postgresqlgroup-bylateral

How do I do LIMIT within GROUP in the same table?


I can't figure out how to do limit within group although I've read all similar questions here. Reading PSQL doc didn't help either :( Consider the following:

CREATE TABLE article_relationship
(
    article_from INT NOT NULL,
    article_to INT NOT NULL,
    score INT
);

I want to get a list of top 5 related articles per given article IDs sorted by score.

Here is what I tried:

select DISTINCT o.article_from
from article_relationship o
join lateral (
       select i.article_from, i.article_to, i.score from article_relationship i
       order by score desc
       limit 5
       ) p on p.article_from = o.article_from
where o.article_from IN (18329382, 61913904, 66538293, 66540477, 66496909)
order by o.article_from;

And it returns nothing. I was under impression that outer query is like loop so I guess I only need source IDs there.

Also what if I want to join on articles table where there are columns id and title and get titles of related articles in resultset?

I added join in inner query:

select o.id, p.*
from articles o
join lateral (
       select a.title, i.article_from, i.article_to, i.score
       from article_relationship i
       INNER JOIN articles a on a.id = i.article_to
       where i.article_from = o.id
       order by score desc
       limit 5
       ) p on true
where o.id IN (18329382, 61913904, 66538293, 66540477, 66496909)
order by o.id;

But it made it very very slow.


Solution

  • The problem with no rows returning from your query is that your join condition is wrong: ON p.article_from = o.article_from; this should obviously be ON p.article_from = o.article_to.

    That issue aside, your query will not return the top 5 scoring relations per article id; instead it will return the article IDs that reference one of the 5 top rated referenced articles throughout the table and (also) at least 1 of the 5 referenced articles for which you specify the id.

    You can get the top 5 rated referenced articles per referencing article with a window function to rank the scores in a sub-select and then select only the top 5 in the main query. Specifying a list of referenced article IDs effectively means that you will rank how these referenced articles are scored for each referencing article:

    SELECT article_from, article_to, score
    FROM (
        SELECT article_from, article_to, score, 
               rank() OVER (PARTITION BY article_from ORDER BY score DESC) AS rnk
        FROM article_relationship
        WHERE article_to IN (18329382, 61913904, 66538293, 66540477, 66496909) ) a
    WHERE rnk < 6
    ORDER BY article_from, score DESC;
    

    This is different from your code in that it returns up to 5 records for each article_from but it is consistent with your initial description.

    Adding columns from table articles is trivially done in the main query:

    SELECT a.article_from, a.article_to, a.score, articles.*
    FROM (
        SELECT article_from, article_to, score, 
               rank() OVER (PARTITION BY article_from ORDER BY score DESC) AS rnk
        FROM article_relationship
        WHERE article_to IN (18329382, 61913904, 66538293, 66540477, 66496909) ) a
    JOIN articles ON articles.id = a.article_to
    WHERE a.rnk < 6
    ORDER BY a.article_from, a.score DESC;