sqlpostgresqljoin

Combining multiple (more than 2) tables on multiple columns


I'm working with multiple tables in Postgres that look similar to this:

create table samples
(sampleid,Location,SampleDate)as values
 (1,    'A',    'Jan 1')
,(2,    'B',    'Mar 5')
,(3,    'A',    'Feb 17');

create table mature_organisms
(SampleID,SpeciesCode,Count)as values
 (1,    'x',    100)
,(1,    'y',    23)
,(1,    'z',    53);

create table juvenile_organisms
(SampleID,SpeciesCode,Count)as values
 (1,    'x',    20)
,(1,    'w',    15)
,(1,    'z',    5);

And so on for many samples. Some species appear in both the mature and juvenile tables, some will appear in only one table depending on the sample.

What I want to output is a table that looks something like this:

SampleID Location SampleDate species mature_count juvenile_count
1 A Jan 1 w 15
1 A Jan 1 x 100 20
1 A Jan 1 y 23
1 A Jan 1 z 53 5

Can this be done without ending up with duplicated columns?

The best I could do was

SELECT m.sampleid
     , j.sampleid
     , s.location
     , s.SampleDate
     , m.speciescode
     , j.speciescode
     , m.count as mcount
     , j.count as count
FROM samples s
RIGHT JOIN mature_organisms m 
  on s.sampleid=m.sampleid
FULL JOIN juvenile_organisms j 
  on m.sampleid=j.sampleid 
 AND m.speciescode=j.speciescode;

Which gives me two sampleid columns and two speciescode columns:

sampleid sampleid location sampledate speciescode speciescode mcount count
null 1 null null null w null 15
1 1 A Jan 1 x x 100 20
1 null A Jan 1 y null 23 null
1 1 A Jan 1 z z 53 5

Solution

  • Can this be done without ending up with duplicated columns? If I specify only sampleid or speciescode from only 1 table, I end up with blank values in that column if the specified table has no data but the other table does.

    For the sample id, you can just refer to s.sampleid from the samples table, not the organism tables which were joined against it.

    For the species code, you can use the COALSCE function which ignores NULL values ("blanks") and can get you the value from either table.

    Also you'll need to change the join order so that the samples table is joined last, against either the juvenile or mature organisms:

    Parentheses can be used around JOIN clauses to control the join order. In the absence of parentheses, JOIN clauses nest left-to-right.

    SELECT
      s.sampleid,
      s.location,
      s.date,
      COALESCE(m.speciescode, j.speciescode) AS species,
      m.count as "mature count",
      j.count as "juvenile count"
    FROM samples s
    JOIN (
      mature_organisms m
      FULL OUTER JOIN juvenile_organisms j ON m.sampleid=j.sampleid AND m.speciescode=j.speciescode
    ) ON s.sampleid=COALESCE(m.sampleid, j.sampleid);
    

    Another approach is to use the USING clause for the join conditions, which gets rid of the duplicate columns automatically:

    SELECT sampleid, location, speciescode AS species, m.count as "mature count", j.count as "juvenile count"
    FROM samples JOIN (mature_organisms m FULL OUTER JOIN juvenile_organisms j USING (sampleid, speciescode)) USING (sampleid);