sqlpostgresqlrelational-divisionsql-match-all

Query to match foreign key relationships


I have two tables in a this is Postgres database representing simple orders from a market. A master table with information about the order, and a detail table with containing specifics of the purchase, with a foreign key back to master. Easy enough.

Over thousands of orders from the market, I'd like to find some specific orders based on what was purchased and in what quantity.

I have two more tables, in similar fashion, a master and a child where I create a "pack" and detail items from the market.

For example: Pack A contains 2 Apples and 3 Oranges. I define that in the tables. Now I'd like to find how many orders, and which orders from the market match that specific combination exactly.

It's important that it is an exact match. An order containing additional products or with any different quantity does not match.

In the SQL Fiddle, I've setup the simple example with data. The raw DDL is below. Two of the orders in the table should match Pack A.

http://sqlfiddle.com/#!17/b4f55

CREATE TABLE customer_order(
 order_id serial PRIMARY KEY NOT NULL,
 customer_name VARCHAR(100) NOT NULL
);

CREATE TABLE order_detail(
    id serial PRIMARY KEY,
    order_id INTEGER,
    item_sku VARCHAR(50),
    item_quantity INTEGER,
    FOREIGN KEY(order_id) REFERENCES customer_order(order_id)
);

INSERT INTO customer_order (customer_name) VALUES ('John');
INSERT INTO customer_order (customer_name) VALUES ('Mary');
INSERT INTO customer_order (customer_name) VALUES ('Bill');

INSERT INTO order_detail (order_id, item_sku, item_quantity) VALUES (1, 'APPLE', 2);
INSERT INTO order_detail (order_id, item_sku, item_quantity) VALUES (1, 'ORANGE', 3);
INSERT INTO order_detail (order_id, item_sku, item_quantity) VALUES (2, 'ORANGE', 5);
INSERT INTO order_detail (order_id, item_sku, item_quantity) VALUES (3, 'APPLE', 2);
INSERT INTO order_detail (order_id, item_sku, item_quantity) VALUES (3, 'ORANGE', 3);

CREATE TABLE pack_master(
 pack_id serial PRIMARY KEY NOT NULL,
 name VARCHAR(100) NOT NULL
);

CREATE TABLE pack_child(
    id serial PRIMARY KEY,
    pack_id INTEGER,
    item_sku VARCHAR(50),
    item_quantity INTEGER,
    FOREIGN KEY(pack_id) REFERENCES pack_master(pack_id)
);

INSERT INTO pack_master (name) VALUES ('Pack A');
INSERT INTO pack_master (name) VALUES ('Pack B');

INSERT INTO pack_child (pack_id, item_sku, item_quantity) VALUES (1, 'APPLE', 2);
INSERT INTO pack_child (pack_id, item_sku, item_quantity) VALUES (1, 'ORANGE', 3);
INSERT INTO pack_child (pack_id, item_sku, item_quantity) VALUES (2, 'GRAPES', 5);

Solution

  • Assuming that pack_child (pack_id, item_sku), as well as order_detail (order_id, item_sku) are defined UNIQUE, this would work:

    SELECT pc.pack_id, od.order_id
    FROM   pack_child pc
    LEFT   JOIN order_detail od USING (item_sku, item_quantity)
    GROUP  BY 1, 2
    HAVING count(*) = count(od.id)  -- every item of the pack has a match
    AND    NOT EXISTS (
       SELECT
       FROM   order_detail od1
       LEFT   JOIN pack_child pc1 ON pc1.item_sku = od1.item_sku
                                 AND pc1.item_quantity = od1.item_quantity
                                 AND pc1.pack_id = pc.pack_id
       WHERE  od1.order_id = od.order_id
       AND    pc1.id IS NULL       -- and order has no additional item
       );
    

    Returns all pairs of pack_id and order_id that match exactly.

    db<>fiddle here

    There are a hundred-and-one alternative ways to write the query. Which is the fastest depends on cardinalities, data distribution, constraints and, most importantly, available indexes.

    It's a special application of . Here is an arsenal of techniques:

    One alternative, probably faster: create views or materialized views of the parent tables including the item count:

    CREATE MATERIALIZED VIEW v_pack_master AS
    SELECT *
    FROM   pack_master
    JOIN  (
       SELECT pack_id, count(*) AS items
       FROM   pack_child
       GROUP  BY 1
       ) c USING (pack_id);
    
    CREATE MATERIALIZED VIEW v_customer_order AS
    SELECT *
    FROM   customer_order
    JOIN  (
       SELECT order_id, count(*) AS items
       FROM   order_detail
       GROUP  BY 1
       ) c USING (order_id);
    

    (Orders typically don't change later, so might be viable candidates for a materialized view.)

    Only if there can be many order items, an index might pay (index expressions in this order):

    CREATE INDEX foo ON v_customer_order (items, order_id);
    

    The query now only considers orders with a matching item count to begin with:

    SELECT * -- pack_id, order_id
    FROM   v_pack_master pm
    LEFT   JOIN v_customer_order co USING (items)
    JOIN   LATERAL (
       SELECT count(*) AS items
       FROM   pack_child pc
       JOIN   order_detail od USING (item_sku, item_quantity)
       WHERE  pc.pack_id  = pm.pack_id
       AND    od.order_id = co.order_id
       ) x USING (items);
    

    .. then, if all items match, we don't have to rule out additional items any more. And we have all columns from the parent table at our disposal right away, to return whatever it is you want to return ...