I have a table zipnames
with four columns: id
, zipcode
, name_fi
and name_sv
. Both of the name_*
columns are nullable. For each possible zipcode
, I might have up to four rows: both names present, only name_fi
present, only name_sv
present, or both null.
How can I write a SELECT
for this table so that I can get only one row back for each zipcode
, prioritizing rows with both languages present, then rows with only one language, and only returning nulls for both languages if we really don't have a row with either name?
If it makes a difference, I'm working with PostgreSQL v14, and using SQLAlchemy in Python. Feel free to answer either in plain SQL or as SQLAlchemy code.
Edit. Since it was requested, here is a sample of the data in the table:
| id | zipcode | name_fi | name_sv |
|----|---------|----------|-------------|
| 1 | 00000 | | |
| 2 | 00100 | HELSINKI | HELSINGFORS |
| 3 | 00100 | HELSINKI | |
| 4 | 00100 | | HELSINGFORS |
| 5 | 00100 | | |
| 6 | 50100 | MIKKELI | |
| 7 | 50100 | | |
| 8 | 64250 | | PJELAX |
And here is what I would like the query to return:
| id | zipcode | name_fi | name_sv |
|----|---------|----------|-------------|
| 1 | 00000 | | |
| 2 | 00100 | HELSINKI | HELSINGFORS |
| 6 | 50100 | MIKKELI | |
| 8 | 64250 | | PJELAX |
You can use DISTINCT ON
combined with ORDER BY
.
Thus all other rows except the "most important" one per zipcode will be excluded:
SELECT
DISTINCT ON (zipcode)
id, zipcode,
name_fi,
name_sv
FROM
zipnames
ORDER BY
zipcode,
name_fi,
name_sv;
NULL
values will be sorted last after other rows per default. You can explicit add this option anyway if you prefer it for a better readability:
SELECT
DISTINCT ON (zipcode)
id, zipcode,
name_fi,
name_sv
FROM
zipnames
ORDER BY
zipcode,
name_fi NULLS LAST,
name_sv NULLS LAST;
Both queries will produce following result for your sample data:
id | zipcode | name_fi | name_sv |
---|---|---|---|
1 | 00000 | null | null |
2 | 00100 | HELSINKI | HELSINGFORS |
6 | 50100 | MIKKELI | null |
8 | 64250 | null | PJELAX |
See this db<>fiddle
Note: Above queries will always return one single row per zipcode only. If multiple rows with both names <> NULL
appear having the same zipcode, only the first one according to the ORDER BY
clause will be fetched. If this is not intended, please explain what to do in this case.