sqlpostgresqlnp

SQL query to find rows with the most matching keywords


I'm really bad at SQL and I would like to know what SQL I can run to solve the problem below which I suspect to be a NP-Complete problem but I'm ok with the query taking a long time to run over large datasets as this will be done as a background task. A standard sql statement is preferred but if a stored procedure is required then so be it. The SQL is required to run on Postgres 9.3.

Problem: Given a set of articles that contain a set of keywords, find the top n articles for each article that contains the most number of matching keywords.

A trimmed down version of the article table looks like this:

CREATE TABLE article (
  id character varying(36) NOT NULL,  -- primary key of article
  keywords character varying,         -- comma separated set of keywords

  CONSTRAINT pk_article PRIMARY KEY (id)
);

-- Test Data
INSERT INTO article(id, keywords) VALUES(0, 'red,green,blue');
INSERT INTO article(id, keywords) VALUES(1, 'red,green,yellow');
INSERT INTO article(id, keywords) VALUES(2, 'purple,orange,blue');
INSERT INTO article(id, keywords) VALUES(3, 'lime,violet,ruby,teal');
INSERT INTO article(id, keywords) VALUES(4, 'red,green,blue,yellow');
INSERT INTO article(id, keywords) VALUES(5, 'yellow,brown,black');
INSERT INTO article(id, keywords) VALUES(6, 'black,white,blue');

Which would result in this for a SELECT * FROM article; query:

Table: article
------------------------
id  keywords            
------------------------
0   red,green,blue      
1   red,green,yellow    
2   purple,orange,blue  
3   lime,violet,ruby,teal
4   red,green,blue,yellow
5   yellow,brown,black
6   black,white,blue

Assuming I want to find the top 3 articles for each article that contains the most number of matching keywords then the output should be this:

------------------------
id  related
------------------------
0   4,1,6
1   4,0,5
2   0,4,6
3   null
4   0,1,6
5   1,6
6   5,0,4

Solution

  • Like @a_horse commented: This would be simpler with a normalized design (besides making other tasks simpler/ cleaner), but still not trivial.

    Also, a PK column of data type character varying(36) is highly suspicious (and inefficient) and should most probably be an integer type or at least a uuid instead.

    Here is one possible solution based on your design as is:

    WITH cte AS (
       SELECT id, string_to_array(a.keywords, ',') AS keys
       FROM   article a
       )
    SELECT id, string_agg(b_id, ',') AS best_matches
    FROM  (
       SELECT a.id, b.id AS b_id
            , row_number() OVER (PARTITION BY a.id ORDER BY ct.ct DESC, b.id) AS rn
       FROM   cte a
       LEFT   JOIN cte b ON a.id <> b.id AND a.keys && b.keys
       LEFT   JOIN LATERAL (
          SELECT count(*) AS ct
          FROM  (
             SELECT * FROM unnest(a.keys)
             INTERSECT ALL
             SELECT * FROM unnest(b.keys)
             ) i
          ) ct ON TRUE
       ORDER  BY a.id, ct.ct DESC, b.id  -- b.id as tiebreaker
       ) sub
    WHERE  rn < 4
    GROUP  BY 1;
    

    sqlfiddle (using an integer id instead).

    The CTE cte converts the string into an array. You could even have a functional GIN index like that ...

    If multiple rows tie for the top 3 picks, you need to define a tiebreaker. In my example, rows with smaller id come first.

    Detailed explanation in this recent related answer:

    The comparison is between a JSON array and an SQL array, but it's basically the same problem, burns down to the same solution(s). Also comparing a couple of similar alternatives.

    To make this fast, you should at least have a GIN index on the array column (instead of the comma-separated string) and the query wouldn't need the CTE step. A completely normalized design has other advantages, but won't necessarily be faster than an array with GIN index.