I'm working with multiple tables in Postgres that look similar to this:
create table samples
(sampleid,Location,SampleDate)as values
(1, 'A', 'Jan 1')
,(2, 'B', 'Mar 5')
,(3, 'A', 'Feb 17');
create table mature_organisms
(SampleID,SpeciesCode,Count)as values
(1, 'x', 100)
,(1, 'y', 23)
,(1, 'z', 53);
create table juvenile_organisms
(SampleID,SpeciesCode,Count)as values
(1, 'x', 20)
,(1, 'w', 15)
,(1, 'z', 5);
And so on for many samples. Some species appear in both the mature and juvenile tables, some will appear in only one table depending on the sample.
What I want to output is a table that looks something like this:
SampleID | Location | SampleDate | species | mature_count | juvenile_count |
---|---|---|---|---|---|
1 | A | Jan 1 | w | 15 | |
1 | A | Jan 1 | x | 100 | 20 |
1 | A | Jan 1 | y | 23 | |
1 | A | Jan 1 | z | 53 | 5 |
Can this be done without ending up with duplicated columns?
The best I could do was
SELECT m.sampleid
, j.sampleid
, s.location
, s.SampleDate
, m.speciescode
, j.speciescode
, m.count as mcount
, j.count as count
FROM samples s
RIGHT JOIN mature_organisms m
on s.sampleid=m.sampleid
FULL JOIN juvenile_organisms j
on m.sampleid=j.sampleid
AND m.speciescode=j.speciescode;
Which gives me two sampleid columns and two speciescode columns:
sampleid | sampleid | location | sampledate | speciescode | speciescode | mcount | count |
---|---|---|---|---|---|---|---|
null | 1 | null | null | null | w | null | 15 |
1 | 1 | A | Jan 1 | x | x | 100 | 20 |
1 | null | A | Jan 1 | y | null | 23 | null |
1 | 1 | A | Jan 1 | z | z | 53 | 5 |
Can this be done without ending up with duplicated columns? If I specify only
sampleid
orspeciescode
from only 1 table, I end up with blank values in that column if the specified table has no data but the other table does.
For the sample id, you can just refer to s.sampleid
from the samples table, not the organism tables which were joined against it.
For the species code, you can use the COALSCE
function which ignores NULL
values ("blanks") and can get you the value from either table.
Also you'll need to change the join order so that the samples table is joined last, against either the juvenile or mature organisms:
Parentheses can be used around
JOIN
clauses to control the join order. In the absence of parentheses,JOIN
clauses nest left-to-right.
SELECT
s.sampleid,
s.location,
s.date,
COALESCE(m.speciescode, j.speciescode) AS species,
m.count as "mature count",
j.count as "juvenile count"
FROM samples s
JOIN (
mature_organisms m
FULL OUTER JOIN juvenile_organisms j ON m.sampleid=j.sampleid AND m.speciescode=j.speciescode
) ON s.sampleid=COALESCE(m.sampleid, j.sampleid);
Another approach is to use the USING
clause for the join conditions, which gets rid of the duplicate columns automatically:
SELECT sampleid, location, speciescode AS species, m.count as "mature count", j.count as "juvenile count"
FROM samples JOIN (mature_organisms m FULL OUTER JOIN juvenile_organisms j USING (sampleid, speciescode)) USING (sampleid);