sqlpostgresqlaggregate-functionsaggregate-filter

Aggregate the same column in multiple different ways


I am trying to get an array of categories associated with each product and then also get the top-level parent category of each product in another column, which by my logic is finding the same values for the categories array, but only selecting where parent_id is NULL which should pull back only one value and 1 record per id.

I really don't know the best way to structure this query. What I have kind of works, but it also shows NULL values in the parent category column for the categories that do have a parent ID and makes a second record for each product because I am forced to put it in the group by. Basically, I think I am not doing this in the correct or most efficient way.

Desired result:

+----+----------------+------------------+------------------------------------------------+------------------+
| id | name           | category_ids     | category_names                                 | parent_category  |
+----+----------------+------------------+------------------------------------------------+------------------+
| 1  | Product Name 1 | {111,222,333}    | {Electronics, computers, computer accessories} | Electronics      |
+----+----------------+------------------+------------------------------------------------+------------------+

My current query (which is not ideal):

select p.id, 
p.name, 
array_agg(category_id) as category_ids,
regexp_replace(array_agg(c.name)::text,'"|''','','gi') as category_names,
c1.name as parent_category
from products p
join product_categorizations pc  on pc.product_id = p.id
join categories c  on pc.category_id = c.id
full outer join (
   select name, id from categories
   where parent_id is null and name is not null
   ) c1 on c.id = c1.id
group by 1,2,5;
+----+----------------+------------------+-----------------------------------+------------------+
| id | name           | category_ids     | category_names                    | parent_category  |
+----+----------------+------------------+-----------------------------------+------------------+
| 1  | Product Name 1 | {111}            | {Electronics}                     | Electronics      |
+----+----------------+------------------+-----------------------------------+------------------+
| 1  | Product Name 1 | {222,333}        | {computers, computer accessories} | NULL             |
+----+----------------+------------------+-----------------------------------+------------------+

Solution

  • Replace the FULL JOIN with an aggregate FILTER clause:

    SELECT p.id
         , p.name
         , array_agg(pc.category_id) AS category_ids
         , string_agg(c.name, ', ')  AS category_names  -- regexp_replace .. ?
         , min(c.name) FILTER (WHERE c.parent_id IS NULL) AS parent_category
    FROM   products                p
    JOIN   product_categorizations pc ON pc.product_id = p.id
    JOIN   categories              c  ON pc.category_id = c.id
    GROUP  BY p.id;

    See:

    (Why would you add AND name IS NOT NULL? Either way, min() ignores NULL values anyway.)

    While aggregating all products, and while referential integrity is enforced, this should be a bit faster:

    SELECT p.name, pc.*
    FROM   products p
    JOIN  (
       SELECT pc.product_id AS id
            , array_agg(pc.category_id) AS category_ids
            , string_agg(c.name, ', ')  AS category_names
            , min(c.name) FILTER (WHERE c.parent_id IS NULL) AS parent_category
       FROM   product_categorizations pc
       JOIN   categories              c  ON pc.category_id = c.id
       GROUP  BY 1
       ) pc  USING (id);
    

    The point being that product only joins after aggregating rows.

    Aside: "name" is not a very helpful column name. Related: