sql postgresql relational-division sql-match-all

Query to match foreign key relationships

I have two tables in a this is Postgres database representing simple orders from a market. A master table with information about the order, and a detail table with containing specifics of the purchase, with a foreign key back to master. Easy enough.

Over thousands of orders from the market, I'd like to find some specific orders based on what was purchased and in what quantity.

I have two more tables, in similar fashion, a master and a child where I create a "pack" and detail items from the market.

For example: Pack A contains 2 Apples and 3 Oranges. I define that in the tables. Now I'd like to find how many orders, and which orders from the market match that specific combination exactly.

It's important that it is an exact match. An order containing additional products or with any different quantity does not match.

In the SQL Fiddle, I've setup the simple example with data. The raw DDL is below. Two of the orders in the table should match Pack A.

http://sqlfiddle.com/#!17/b4f55

CREATE TABLE customer_order(
 order_id serial PRIMARY KEY NOT NULL,
 customer_name VARCHAR(100) NOT NULL
);

CREATE TABLE order_detail(
    id serial PRIMARY KEY,
    order_id INTEGER,
    item_sku VARCHAR(50),
    item_quantity INTEGER,
    FOREIGN KEY(order_id) REFERENCES customer_order(order_id)
);

INSERT INTO customer_order (customer_name) VALUES ('John');
INSERT INTO customer_order (customer_name) VALUES ('Mary');
INSERT INTO customer_order (customer_name) VALUES ('Bill');

INSERT INTO order_detail (order_id, item_sku, item_quantity) VALUES (1, 'APPLE', 2);
INSERT INTO order_detail (order_id, item_sku, item_quantity) VALUES (1, 'ORANGE', 3);
INSERT INTO order_detail (order_id, item_sku, item_quantity) VALUES (2, 'ORANGE', 5);
INSERT INTO order_detail (order_id, item_sku, item_quantity) VALUES (3, 'APPLE', 2);
INSERT INTO order_detail (order_id, item_sku, item_quantity) VALUES (3, 'ORANGE', 3);

CREATE TABLE pack_master(
 pack_id serial PRIMARY KEY NOT NULL,
 name VARCHAR(100) NOT NULL
);

CREATE TABLE pack_child(
    id serial PRIMARY KEY,
    pack_id INTEGER,
    item_sku VARCHAR(50),
    item_quantity INTEGER,
    FOREIGN KEY(pack_id) REFERENCES pack_master(pack_id)
);

INSERT INTO pack_master (name) VALUES ('Pack A');
INSERT INTO pack_master (name) VALUES ('Pack B');

INSERT INTO pack_child (pack_id, item_sku, item_quantity) VALUES (1, 'APPLE', 2);
INSERT INTO pack_child (pack_id, item_sku, item_quantity) VALUES (1, 'ORANGE', 3);
INSERT INTO pack_child (pack_id, item_sku, item_quantity) VALUES (2, 'GRAPES', 5);

Solution

Assuming that pack_child (pack_id, item_sku), as well as order_detail (order_id, item_sku) are defined UNIQUE, this would work:

SELECT pc.pack_id, od.order_id
FROM   pack_child pc
LEFT   JOIN order_detail od USING (item_sku, item_quantity)
GROUP  BY 1, 2
HAVING count(*) = count(od.id)  -- every item of the pack has a match
AND    NOT EXISTS (
   SELECT
   FROM   order_detail od1
   LEFT   JOIN pack_child pc1 ON pc1.item_sku = od1.item_sku
                             AND pc1.item_quantity = od1.item_quantity
                             AND pc1.pack_id = pc.pack_id
   WHERE  od1.order_id = od.order_id
   AND    pc1.id IS NULL       -- and order has no additional item
   );

Returns all pairs of pack_id and order_id that match exactly.

db<>fiddle here

There are a hundred-and-one alternative ways to write the query. Which is the fastest depends on cardinalities, data distribution, constraints and, most importantly, available indexes.

It's a special application of relational-division. Here is an arsenal of techniques:

How to filter SQL results in a has-many-through relation

One alternative, probably faster: create views or materialized views of the parent tables including the item count:

CREATE MATERIALIZED VIEW v_pack_master AS
SELECT *
FROM   pack_master
JOIN  (
   SELECT pack_id, count(*) AS items
   FROM   pack_child
   GROUP  BY 1
   ) c USING (pack_id);

CREATE MATERIALIZED VIEW v_customer_order AS
SELECT *
FROM   customer_order
JOIN  (
   SELECT order_id, count(*) AS items
   FROM   order_detail
   GROUP  BY 1
   ) c USING (order_id);

(Orders typically don't change later, so might be viable candidates for a materialized view.)

Only if there can be many order items, an index might pay (index expressions in this order):

CREATE INDEX foo ON v_customer_order (items, order_id);

The query now only considers orders with a matching item count to begin with:

SELECT * -- pack_id, order_id
FROM   v_pack_master pm
LEFT   JOIN v_customer_order co USING (items)
JOIN   LATERAL (
   SELECT count(*) AS items
   FROM   pack_child pc
   JOIN   order_detail od USING (item_sku, item_quantity)
   WHERE  pc.pack_id  = pm.pack_id
   AND    od.order_id = co.order_id
   ) x USING (items);

.. then, if all items match, we don't have to rule out additional items any more. And we have all columns from the parent table at our disposal right away, to return whatever it is you want to return ...