sqlmysqlruby-on-railstagstagging

Sql query to find things tagged with all specified tags


Let's say I have the following tables:

TAGS

id: integer
name: string

POSTS

id: integer
body: text

TAGGINGS

id: integer
tag_id: integer
post_id: integer

How would I go about writing a query that select all posts that are tagged with ALL of the following tags (name attribute of tags table): "Cheese", "Wine", "Paris", "Frace", "City", "Scenic", "Art"

See also: Sql query to find things with most specified tags (note: similar, but not a duplicate!)


Solution

  • Using IN:

    SELECT p.*
      FROM POSTS p
     WHERE p.id IN (SELECT tg.post_id
                      FROM TAGGINGS tg
                      JOIN TAGS t ON t.id = tg.tag_id
                     WHERE t.name IN ('Cheese','Wine','Paris','Frace','City','Scenic','Art')
                  GROUP BY tg.post_id
                    HAVING COUNT(DISTINCT t.name) = 7)
    

    Using a JOIN

    SELECT p.*
      FROM POSTS p
      JOIN (SELECT tg.post_id
              FROM TAGGINGS tg
              JOIN TAGS t ON t.id = tg.tag_id
             WHERE t.name IN ('Cheese','Wine','Paris','Frace','City','Scenic','Art')
          GROUP BY tg.post_id
            HAVING COUNT(DISTINCT t.name) = 7) x ON x.post_id = p.id
    

    Using EXISTS

    SELECT p.*
      FROM POSTS p
     WHERE EXISTS (SELECT NULL
                     FROM TAGGINGS tg
                     JOIN TAGS t ON t.id = tg.tag_id
                    WHERE t.name IN ('Cheese','Wine','Paris','Frace','City','Scenic','Art')
                      AND tg.post_id = p.id
                 GROUP BY tg.post_id
                   HAVING COUNT(DISTINCT t.name) = 7)
    

    Explanation

    The crux of things is that the COUNT(DISTINCT t.name) needs to match the number of tag names to ensure that all those tags are related to the post. Without the DISTINCT, there's a risk that duplicates of one of the names could return a count of 7--so you'd have a false positive.

    Performance

    Most will tell you the JOIN is optimal, but JOINs also risk duplicating rows in the resultset. EXISTS would be my next choice--no duplicate risk, and generally faster execution but checking the explain plan will ultimately tell you what's best based on your setup and data.