I need to find supersets within an association table.
You can consider bag
to be a collection of item
records. The relationship between the two is tracked in bag_item
.
I need to know which bag
s are supersets of other bag
s: containing ALL the item
s and possibly one or more additional items.
bag
id | name |
---|---|
1 | Bag A |
2 | Bag B |
3 | Bag C |
4 | Bag D |
item
id | name |
---|---|
1 | Pencil |
2 | Pen |
3 | Eraser |
4 | Marker |
bag_item
(association table)
id | bag_id | item_id |
---|---|---|
1 | 1 | 1 |
2 | 1 | 2 |
3 | 2 | 1 |
4 | 2 | 2 |
5 | 2 | 3 |
6 | 2 | 4 |
7 | 3 | 1 |
8 | 3 | 4 |
9 | 4 | 1 |
10 | 4 | 2 |
11 | 4 | 3 |
Database: MySQL 8
I have "base" bag
identifiers and I want to get "superset" bag
identifiers that contain ALL the items in my "base" bag (and more items if present).
If I query for ALL bags, the response should look something like this:
base_bag_id | superset_bag_id |
---|---|
1 | 2 |
1 | 4 |
3 | 2 |
4 | 2 |
But, I also want to be able to filter on the basis of the base_bag_id
, e.g. 1 & 4.
I've tried something to the effect of the following:
select distinct base_bag_item.bag_id, superset_bag_item.bag_id
from bag_item as base_bag_item
join bag_item as superset_bag_item on superset_bag_item.item_id = base_bag_item.item_id
where superset_bag_item.bag_id != base_bag_item.bag_id
and base_bag_item.bag_id in (1, 4)
order by base_bag_item.bag_id, superset_bag_item.bag_id
However, this query will incorrectly produce bags that contain at least one of the quotes from the "base", e.g. base_bag_id
1 will have superset_bag_id
3 because both bags contain item_id
1.
How do I get my desired output?
This is classic Relational Division With Remainder, with the multiple divisors tweak (you want to get all bags for all other possible bags, not just for one bag).
There are a number of solutions.
A simple one, if rather inefficient, is a double NOT EXISTS
.
SELECT
base.id AS base_bag_id,
s.id AS superset_bag_id
FROM bag base
JOIN bag s
ON s.id <> base.id
AND NOT EXiSTS (SELECT 1
FROM bag_item bi
WHERE bi.bag_id = base.id
AND NOT EXISTS (SELECT 1
FROM bag_item si
WHERE si.item_id = bi.item_id
AND si.bag_id = s.id
)
);
A more efficient solution is to join everything up using a left-join, then group and check the count for missing values.
SELECT
base.bag_id AS base_bag_id,
s.id AS superset_bag_id
FROM bag_item base
JOIN bag s ON s.id <> base.bag_id
LEFT JOIN bag_item si
ON si.bag_id = s.id
AND si.item_id = base.item_id
GROUP BY
base.bag_id,
s.id
HAVING COUNT(si.item_id) = COUNT(*);
However, performance is still poor with a large number of bags and items. This can be significantly improved by limiting the superset bag candidates to those that contain at least one of the items in the base bag.
SELECT
base.bag_id AS base_bag_id,
s.id AS superset_bag_id
FROM bag_item base
JOIN bag s
ON s.id <> base.bag_id
AND s.id IN (
SELECT DISTINCT overlap_bag_item.bag_id
FROM bag_item AS base_bag_item_mirror
JOIN bag_item AS overlap_bag_item
ON overlap_bag_item.item_id = base_bag_item_mirror.item_id
WHERE base_bag_item_mirror.bag_id = base.bag_id)
LEFT JOIN bag_item si
ON si.bag_id = s.id
AND si.item_id = base.item_id
GROUP BY
base.bag_id,
s.id
HAVING COUNT(si.item_id) = COUNT(*);
This can then be modified to filter out the base bags that we're interested in, e.g. base bag id
of 1 or 4.
SELECT
base.bag_id AS base_bag_id,
s.id AS superset_bag_id
FROM bag_item base
JOIN bag s
ON s.id <> base.bag_id
AND s.id IN (
SELECT /*+ NO_SEMIJOIN(MATERIALIZATION) */ DISTINCT overlap_bag_item.bag_id
FROM bag_item AS base_bag_item_mirror
JOIN bag_item AS overlap_bag_item
ON overlap_bag_item.item_id = base_bag_item_mirror.item_id
WHERE base_bag_item_mirror.bag_id = base.bag_id)
LEFT JOIN bag_item si
ON si.bag_id = s.id
AND si.item_id = base.item_id
WHERE base.bag_id IN (1, 4)
GROUP BY
base.bag_id,
s.id
HAVING COUNT(si.item_id) = COUNT(*);
This solution selectively turns off semijoin optimization in MySQL for the superset filtering subquery. When semijoin optimization is enabled, adding the WHERE base.bag_id IN (1, 4)
statement results in the creation of a temporary table with all the bag_item
s in it. This defeats the purpose of the subquery and tanks performance. We avoid this with an optimizer hint.
If you switch to WHERE (base.bag_id = 1 OR base.bag_id = 4)
, semijoin optimization isn't performed on the superset subquery and the hint isn't needed.
When making these kinds of optimization decisions, always check EXPLAIN
first.
If necessary, we can also remove the reference to bag
by using a window function over the first bag_item
.
SELECT
base.bag_id AS base_bag_id,
si.bag_id AS superset_bag_id
FROM (
SELECT *,
COUNT(*) OVER (PARTITION BY base.bag_id) AS count
FROM bag_item base
) base
LEFT JOIN bag_item si
ON si.item_id = base.item_id
AND si.bag_id <> base.bag_id
GROUP BY
base.bag_id,
si.bag_id
HAVING COUNT(*) = MIN(base.count);