sqlrelational-algebra

Find the sids of the suppliers who supply every part


Suppliers(sid, sname, address)
Parts(pid, pname, colour)
Catalog(sid, pid, cost)

The answer to 'find all the suppliers who supply every part' is:

SELECT  C.sid
  FROM  Catalog C
  WHERE NOT EXISTS (
     SELECT  P.pid
       FROM  Parts P
       WHERE NOT EXISTS (
         SELECT  C1.sid
           FROM  Catalog C1
           WHERE C1.sid = C.sid
             AND C1.pid = P.pid
         )
     )

What is an explanation for this answer?

I've heard it explained as 'Find suppliers such that there does not exist a part that they do not sell', but I'm struggling to see how the following accomplishes that.

         SELECT  C1.sid
           FROM  Catalog C1
           WHERE C1.sid = C.sid
             AND C1.pid = P.pid

If Catalog is

James | Hammer
James | Anvil
James | Wrench
Henry | Hammer
Leroy | Anvil

and Part is

Hammer
Anvil
Wrench

then after the innermost clause what is returned?

Is it

James | Hammer
James | Anvil
James | Wrench
Henry | Hammer
Leroy | Anvil

and then does NOT EXISTS make it the following?

James | --
Henry | Anvil, Wrench
Leroy | Hammer, Wrench

How does using the Parts table then subtract these values?


Solution

  • This is a double nested NOT EXISTS query (no really, that's what I've usually seen it called), and it is used specificially to answer this type of quesion, i.e., "Are there any x true for all y?"

    Here's MySQL's page on EXISTS and NOT EXISTS, which mentions this technique specifically.

    First, in the innermost SELECT query, you are selecting parts that each store carries. Then, with the first NOT EXISTS clause, you are selecting parts that are NOT carried by each store. Finally, in the outer NOT EXISTS clause, you are selecting stores that returned an empty set for the inner NOT EXISTS clause, meaning they carry every part.

    Here is a SQLFiddle of this query in action.

    A word of warning: if you are working with SQL, it is always good to think and work in sets, and thinking in linear terms like what is about to follow can get you into trouble quickly. Don't make it a habit!

    However, sometimes when trying to figure out a complex query like this, it can help to think of these things as loops.

    So, taking the data in the fiddle as an example, we have:

    suppliers:
    sid, name
    9, 'AAA'
    8, 'BBB'
    7, 'CCC'
    
    parts:
    pid, name
    1, 'wood'
    2, 'stone'
    3, 'paper'
    
    catalog:
    cid, pid, sid
    1,1,9
    2,2,9
    3,1,8
    4,1,7
    5,2,7
    6,3,7
    

    So with this data, AAA carries wood and stone, BBB only carries wood, and CCC carries wood, stone, and paper.

    Now let's step through the query row by row. We are selecting from suppliers, and we are deciding which rows to include in the result set, so start with the first row in suppliers: 9,'AAA'. We'll call this row S temporarily. We'll only include this row if there is nothing in the inner result set, so lets take a look at that.

    suppliers:
    sid, name
    S => 9, 'AAA'
         8, 'BBB'
         7, 'CCC'
    

    This result set is selecting from parts, and we are going to go through it row by row. S still equals 9,'AAA' while we do this. So start with the first row in parts: 1,'wood'. We'll call this row P for now. We'll only include this row in this first inner result set if there is nothing in the next level of result set, so let's move there. Remember that S = 9,'AAA' and P = 1,'wood'.

    suppliers:
    sid, name
    S => 9, 'AAA'
         8, 'BBB'
         7, 'CCC'
    
    parts:
    pid, name
    P => 1, 'wood'
         2, 'stone'
         3, 'paper'
    

    This innermost query is selecting from 'catalog'. We're looking for any row, we'll call it C, in catalog where C.sid equals S.sid AND C.pid equals P.pid. This means that the current supplier carries the part. We want parts that the current supplier DOESN'T carry, that's why we invert the result set. We know that S's sid is 9, and we know that P's pid is 1. Are there any rows in C that match that? The very first row in C matches that, so we know that this result set is not empty.

    suppliers:
    sid, name
    S => 9, 'AAA'
         8, 'BBB'
         7, 'CCC'
    
    parts:
    pid, name
    P => 1, 'wood'
         2, 'stone'
         3, 'paper'
    
    catalog:
    cid, pid, sid
    C => 1,1,9 --Match found! Don't include P in outer result set
         2,2,9
         3,1,8
         4,1,7
         5,2,7
         6,3,7
    

    Now jump back to the next outermost loop. The inner result set was not empty, so we know that 1,'wood will not be part of this loop's result set. So we move to the next row in parts, 2,'stone'. We update the value of P to equal this row. Should we include this row in the result set? We have to run the inner query again with our new value for P (S has still not changed). So we look for any rows in catalog with sid equal to 9 and pid equal to 2. The second row matches, so there is a result set.

    suppliers:
    sid, name
    S => 9, 'AAA'
         8, 'BBB'
         7, 'CCC'
    
    parts:
    pid, name
         1, 'wood'
    P => 2, 'stone'
         3, 'paper'
    
    catalog:
    cid, pid, sid
         1,1,9 
    C => 2,2,9 --Match found! Don't include P in outer result set
         3,1,8
         4,1,7
         5,2,7
         6,3,7
    

    Jump back to the next outermost loop. The inner result set was not empty, so 2,'stone' will not be part of this loop's result set.

    When we go through all this again for 3,'paper', we find that there are no rows in catalog that have sid = 9 and pid = 3. The innermost result set is empty, so we know to include this value of P in the next outermost loop.

    suppliers:
    sid, name
    S => 9, 'AAA'
         8, 'BBB'
         7, 'CCC'
    
    parts:
    pid, name
         1, 'wood'
         2, 'stone'
    P => 3, 'paper'
    
    catalog:
    cid, pid, sid
         1,1,9 
         2,2,9 
         3,1,8
         4,1,7
         5,2,7
         6,3,7
    C => --No match found, include P in outer result set
    

    We have gone through the entire parts table at this point, so we have our final result set for the second loop, and it is not empty, which means that we found a part that S does not carry, so we know that we cannot include the current value of S in our final outer loop's result set.

    So we move on to the next row in suppliers and start the whole process over:

    suppliers:
    sid, name
         9, 'AAA'
    S => 8, 'BBB'
         7, 'CCC'
    

    Once we get to S = 7,'CCC' we will go through all of these loops, and a match will be found in the inner loop for every value of P that is supplied, which means that that second loop will have an empty set. We couldn't find any parts that supplier doesn't carry, so that value of S is added to the result set, meaning they carry everything!