SQL query to find rows with the most matching keywords

I'm really bad at SQL and I would like to know what SQL I can run to solve the problem below which I suspect to be a NP-Complete problem but I'm ok with the query taking a long time to run over large datasets as this will be done as a background task. A standard sql statement is preferred but if a stored procedure is required then so be it. The SQL is required to run on Postgres 9.3.

Problem: Given a set of articles that contain a set of keywords, find the top n articles for each article that contains the most number of matching keywords.

A trimmed down version of the article table looks like this:

CREATE TABLE article (
  id character varying(36) NOT NULL,  -- primary key of article
  keywords character varying,         -- comma separated set of keywords

  CONSTRAINT pk_article PRIMARY KEY (id)
);

-- Test Data
INSERT INTO article(id, keywords) VALUES(0, 'red,green,blue');
INSERT INTO article(id, keywords) VALUES(1, 'red,green,yellow');
INSERT INTO article(id, keywords) VALUES(2, 'purple,orange,blue');
INSERT INTO article(id, keywords) VALUES(3, 'lime,violet,ruby,teal');
INSERT INTO article(id, keywords) VALUES(4, 'red,green,blue,yellow');
INSERT INTO article(id, keywords) VALUES(5, 'yellow,brown,black');
INSERT INTO article(id, keywords) VALUES(6, 'black,white,blue');

Which would result in this for a SELECT * FROM article; query:

Table: article
------------------------
id  keywords            
------------------------
0   red,green,blue      
1   red,green,yellow    
2   purple,orange,blue  
3   lime,violet,ruby,teal
4   red,green,blue,yellow
5   yellow,brown,black
6   black,white,blue

Assuming I want to find the top 3 articles for each article that contains the most number of matching keywords then the output should be this:

------------------------
id  related
------------------------
0   4,1,6
1   4,0,5
2   0,4,6
3   null
4   0,1,6
5   1,6
6   5,0,4

Solution

Like @a_horse commented: This would be simpler with a normalized design (besides making other tasks simpler/ cleaner), but still not trivial.

Also, a PK column of data type character varying(36) is highly suspicious (and inefficient) and should most probably be an integer type or at least a uuid instead.

Here is one possible solution based on your design as is:

WITH cte AS (
   SELECT id, string_to_array(a.keywords, ',') AS keys
   FROM   article a
   )
SELECT id, string_agg(b_id, ',') AS best_matches
FROM  (
   SELECT a.id, b.id AS b_id
        , row_number() OVER (PARTITION BY a.id ORDER BY ct.ct DESC, b.id) AS rn
   FROM   cte a
   LEFT   JOIN cte b ON a.id <> b.id AND a.keys && b.keys
   LEFT   JOIN LATERAL (
      SELECT count(*) AS ct
      FROM  (
         SELECT * FROM unnest(a.keys)
         INTERSECT ALL
         SELECT * FROM unnest(b.keys)
         ) i
      ) ct ON TRUE
   ORDER  BY a.id, ct.ct DESC, b.id  -- b.id as tiebreaker
   ) sub
WHERE  rn < 4
GROUP  BY 1;

sqlfiddle (using an integer id instead).

The CTE cte converts the string into an array. You could even have a functional GIN index like that ...

If multiple rows tie for the top 3 picks, you need to define a tiebreaker. In my example, rows with smaller id come first.

Detailed explanation in this recent related answer:

Query and order by number of matches in JSON array

The comparison is between a JSON array and an SQL array, but it's basically the same problem, burns down to the same solution(s). Also comparing a couple of similar alternatives.

To make this fast, you should at least have a GIN index on the array column (instead of the comma-separated string) and the query wouldn't need the CTE step. A completely normalized design has other advantages, but won't necessarily be faster than an array with GIN index.