I have been asked to solve this question:
Say that a big city is defined as a
place
of typecity
with a population of at least 100,000. Write an SQL query that returns the scheme(state_name,no_big_city,big_city_population)
ordered bystate_name
, listing those states which have either (a) at least five big cities or (b) at least one million people living in big cities. The columnstate_name
is thename
of thestate
,no_big_city
is the number of big cities in the state, andbig_city_population
is the number of people living in big cities in the state.
Now, as far as I can see, the following query returns correct results:
SELECT state.name AS state_name
, COUNT(CASE WHEN place.type = 'city' AND place.population >= 100000 THEN 1 ELSE NULL END) AS no_big_city
, SUM (CASE WHEN place.type = 'city' AND place.population >= 100000 THEN place.population ELSE NULL END) AS big_city_population
FROM state
JOIN place ON state.code = place.state_code
GROUP BY state_name
HAVING COUNT(CASE WHEN place.type = 'city' AND place.population >= 100000 THEN 1 ELSE NULL END) >= 5
OR SUM (CASE WHEN place.type = 'city' AND place.population >= 100000 THEN place.population ELSE NULL END) >= 1000000
ORDER BY state_name;
The lengthy aggregate function expressions appear twice.
Is there any way to avoid code duplication while preserving functionality?
I have already tried using the alias, but I just get a "column does not exist" error.
An output column's name can be used to refer to the column's value in
ORDER BY
andGROUP BY
clauses, but not in theWHERE
orHAVING
clauses; there you must write out the expression instead.
Bold emphasis mine.
You can avoid spelling out lengthy expressions repeatedly with a subquery or CTE:
SELECT state_name, no_big_city, big_city_population
FROM (
SELECT s.name AS state_name
, count(*) FILTER (WHERE p.type = 'city' AND p.population >= 100000) AS no_big_city
, sum(population) FILTER (WHERE p.type = 'city' AND p.population >= 100000) AS big_city_population
FROM state s
JOIN place p ON s.code = p.state_code
GROUP BY s.name -- can be input column name as well, best schema-qualified to avoid ambiguity
) sub
WHERE no_big_city >= 5 OR big_city_population >= 1000000
ORDER BY state_name;
While being at it, I simplified with the aggregate FILTER
clause. See:
However, I suggest this simpler and faster query to begin with:
SELECT s.state_name, p.no_big_city, p.big_city_population
FROM state s
JOIN (
SELECT state_code AS code -- alias just to simplify join
, count(*) AS no_big_city
, sum(population) AS big_city_population
FROM place
WHERE type = 'city'
AND population >= 100000
GROUP BY 1 -- can be ordinal number referencing position in SELECT list
HAVING count(*) >= 5 OR sum(population) >= 1000000 -- simple expressions now
) p USING (code)
ORDER BY 1; -- can also be ordinal number
In GROUP BY
and ORDER BY
you can also reference output columns in the SELECT
list by ordinal position. Can be useful to avoid confusion about the scope of input and output column names.