sqlmysqlrelational-division

Find supersets of association database table


I need to find supersets within an association table.

You can consider bag to be a collection of item records. The relationship between the two is tracked in bag_item.

I need to know which bags are supersets of other bags: containing ALL the items and possibly one or more additional items.

bag

id name
1 Bag A
2 Bag B
3 Bag C
4 Bag D

item

id name
1 Pencil
2 Pen
3 Eraser
4 Marker

bag_item (association table)

id bag_id item_id
1 1 1
2 1 2
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 3 4
9 4 1
10 4 2
11 4 3

Database: MySQL 8

I have "base" bag identifiers and I want to get "superset" bag identifiers that contain ALL the items in my "base" bag (and more items if present).

If I query for ALL bags, the response should look something like this:

base_bag_id superset_bag_id
1 2
1 4
3 2
4 2

But, I also want to be able to filter on the basis of the base_bag_id, e.g. 1 & 4.

I've tried something to the effect of the following:

select distinct base_bag_item.bag_id, superset_bag_item.bag_id
from bag_item as base_bag_item
join bag_item as superset_bag_item on superset_bag_item.item_id = base_bag_item.item_id
where superset_bag_item.bag_id != base_bag_item.bag_id
and base_bag_item.bag_id in (1, 4)
order by base_bag_item.bag_id, superset_bag_item.bag_id

However, this query will incorrectly produce bags that contain at least one of the quotes from the "base", e.g. base_bag_id 1 will have superset_bag_id 3 because both bags contain item_id 1.

How do I get my desired output?


Solution

  • This is classic Relational Division With Remainder, with the multiple divisors tweak (you want to get all bags for all other possible bags, not just for one bag).

    There are a number of solutions.


    A simple one, if rather inefficient, is a double NOT EXISTS.

    SELECT
      base.id AS base_bag_id,
      s.id AS superset_bag_id
    FROM bag base
    JOIN bag s
       ON s.id <> base.id
      AND NOT EXiSTS (SELECT 1
        FROM bag_item bi
        WHERE bi.bag_id = base.id
          AND NOT EXISTS (SELECT 1
            FROM bag_item si
            WHERE si.item_id = bi.item_id
              AND si.bag_id = s.id
        )
    );
    

    A more efficient solution is to join everything up using a left-join, then group and check the count for missing values.

    SELECT
      base.bag_id AS base_bag_id,
      s.id AS superset_bag_id
    FROM bag_item base
    JOIN bag s ON s.id <> base.bag_id
    LEFT JOIN bag_item si
       ON si.bag_id = s.id
      AND si.item_id = base.item_id
    GROUP BY
      base.bag_id,
      s.id
    HAVING COUNT(si.item_id) = COUNT(*);
    

    However, performance is still poor with a large number of bags and items. This can be significantly improved by limiting the superset bag candidates to those that contain at least one of the items in the base bag.

    SELECT
      base.bag_id AS base_bag_id,
      s.id AS superset_bag_id
    FROM bag_item base
    JOIN bag s
      ON s.id <> base.bag_id
      AND s.id IN (
        SELECT DISTINCT overlap_bag_item.bag_id
        FROM bag_item AS base_bag_item_mirror
        JOIN bag_item AS overlap_bag_item
          ON overlap_bag_item.item_id = base_bag_item_mirror.item_id
        WHERE base_bag_item_mirror.bag_id = base.bag_id)
    LEFT JOIN bag_item si
      ON si.bag_id = s.id
      AND si.item_id = base.item_id
    GROUP BY
      base.bag_id,
      s.id
    HAVING COUNT(si.item_id) = COUNT(*);
    

    This can then be modified to filter out the base bags that we're interested in, e.g. base bag id of 1 or 4.

    SELECT
      base.bag_id AS base_bag_id,
      s.id AS superset_bag_id
    FROM bag_item base
    JOIN bag s
      ON s.id <> base.bag_id
      AND s.id IN (
        SELECT /*+ NO_SEMIJOIN(MATERIALIZATION) */ DISTINCT overlap_bag_item.bag_id
        FROM bag_item AS base_bag_item_mirror
        JOIN bag_item AS overlap_bag_item
          ON overlap_bag_item.item_id = base_bag_item_mirror.item_id
        WHERE base_bag_item_mirror.bag_id = base.bag_id)
    LEFT JOIN bag_item si
      ON si.bag_id = s.id
      AND si.item_id = base.item_id
    WHERE base.bag_id IN (1, 4)
    GROUP BY
      base.bag_id,
      s.id
    HAVING COUNT(si.item_id) = COUNT(*);
    

    This solution selectively turns off semijoin optimization in MySQL for the superset filtering subquery. When semijoin optimization is enabled, adding the WHERE base.bag_id IN (1, 4) statement results in the creation of a temporary table with all the bag_items in it. This defeats the purpose of the subquery and tanks performance. We avoid this with an optimizer hint.

    If you switch to WHERE (base.bag_id = 1 OR base.bag_id = 4), semijoin optimization isn't performed on the superset subquery and the hint isn't needed.

    When making these kinds of optimization decisions, always check EXPLAIN first.


    If necessary, we can also remove the reference to bag by using a window function over the first bag_item.

    SELECT
      base.bag_id AS base_bag_id,
      si.bag_id AS superset_bag_id
    FROM (
        SELECT *,
          COUNT(*) OVER (PARTITION BY base.bag_id) AS count
        FROM bag_item base
    ) base
    LEFT JOIN bag_item si
       ON si.item_id = base.item_id
      AND si.bag_id <> base.bag_id
    GROUP BY
      base.bag_id,
      si.bag_id
    HAVING COUNT(*) = MIN(base.count);
    

    db<>fiddle