sqlpostgresqlcaseaggregate-functionshaving

Scope of a column alias in a SELECT with GROUP BY


I have been asked to solve this question:

Say that a big city is defined as a place of type city with a population of at least 100,000. Write an SQL query that returns the scheme (state_name,no_big_city,big_city_population) ordered by state_name, listing those states which have either (a) at least five big cities or (b) at least one million people living in big cities. The column state_name is the name of the state, no_big_city is the number of big cities in the state, and big_city_population is the number of people living in big cities in the state.

Now, as far as I can see, the following query returns correct results:

SELECT state.name AS state_name
     , COUNT(CASE WHEN place.type = 'city' AND place.population >= 100000 THEN 1 ELSE NULL END) AS no_big_city
     , SUM  (CASE WHEN place.type = 'city' AND place.population >= 100000 THEN place.population ELSE NULL END) AS big_city_population
FROM state
JOIN place ON state.code = place.state_code
GROUP BY state_name
HAVING COUNT(CASE WHEN place.type = 'city' AND place.population >= 100000 THEN 1 ELSE NULL END) >= 5
    OR SUM  (CASE WHEN place.type = 'city' AND place.population >= 100000 THEN place.population ELSE NULL END) >= 1000000
ORDER BY state_name;

The lengthy aggregate function expressions appear twice.
Is there any way to avoid code duplication while preserving functionality?

I have already tried using the alias, but I just get a "column does not exist" error.


Solution

  • The manual clarifies:

    An output column's name can be used to refer to the column's value in ORDER BY and GROUP BY clauses, but not in the WHERE or HAVING clauses; there you must write out the expression instead.

    Bold emphasis mine.

    You can avoid spelling out lengthy expressions repeatedly with a subquery or CTE:

    SELECT state_name, no_big_city, big_city_population
    FROM  (
       SELECT s.name AS state_name
            , count(*)        FILTER (WHERE p.type = 'city' AND p.population >= 100000) AS no_big_city
            , sum(population) FILTER (WHERE p.type = 'city' AND p.population >= 100000) AS big_city_population
       FROM   state s
       JOIN   place p ON s.code = p.state_code
       GROUP  BY s.name  -- can be input column name as well, best schema-qualified to avoid ambiguity
       ) sub
    WHERE  no_big_city >= 5 OR big_city_population >= 1000000
    ORDER  BY state_name;
    

    While being at it, I simplified with the aggregate FILTER clause. See:

    However, I suggest this simpler and faster query to begin with:

    SELECT s.state_name, p.no_big_city, p.big_city_population
    FROM   state s
    JOIN  (
       SELECT state_code      AS code  -- alias just to simplify join
            , count(*)        AS no_big_city
            , sum(population) AS big_city_population
       FROM   place
       WHERE  type = 'city'
       AND    population >= 100000
       GROUP  BY 1  -- can be ordinal number referencing position in SELECT list
       HAVING count(*) >= 5 OR sum(population) >= 1000000  -- simple expressions now
       ) p USING (code)
    ORDER  BY 1;    -- can also be ordinal number
    

    In GROUP BY and ORDER BY you can also reference output columns in the SELECT list by ordinal position. Can be useful to avoid confusion about the scope of input and output column names.