sql postgresql aggregate-functions aggregate-filter

Aggregate the same column in multiple different ways

I am trying to get an array of categories associated with each product and then also get the top-level parent category of each product in another column, which by my logic is finding the same values for the categories array, but only selecting where parent_id is NULL which should pull back only one value and 1 record per id.

I really don't know the best way to structure this query. What I have kind of works, but it also shows NULL values in the parent category column for the categories that do have a parent ID and makes a second record for each product because I am forced to put it in the group by. Basically, I think I am not doing this in the correct or most efficient way.

Desired result:

+----+----------------+------------------+------------------------------------------------+------------------+
| id | name           | category_ids     | category_names                                 | parent_category  |
+----+----------------+------------------+------------------------------------------------+------------------+
| 1  | Product Name 1 | {111,222,333}    | {Electronics, computers, computer accessories} | Electronics      |
+----+----------------+------------------+------------------------------------------------+------------------+

My current query (which is not ideal):

select p.id, 
p.name, 
array_agg(category_id) as category_ids,
regexp_replace(array_agg(c.name)::text,'"|''','','gi') as category_names,
c1.name as parent_category
from products p
join product_categorizations pc  on pc.product_id = p.id
join categories c  on pc.category_id = c.id
full outer join (
   select name, id from categories
   where parent_id is null and name is not null
   ) c1 on c.id = c1.id
group by 1,2,5;

+----+----------------+------------------+-----------------------------------+------------------+
| id | name           | category_ids     | category_names                    | parent_category  |
+----+----------------+------------------+-----------------------------------+------------------+
| 1  | Product Name 1 | {111}            | {Electronics}                     | Electronics      |
+----+----------------+------------------+-----------------------------------+------------------+
| 1  | Product Name 1 | {222,333}        | {computers, computer accessories} | NULL             |
+----+----------------+------------------+-----------------------------------+------------------+

Solution

Replace the FULL JOIN with an aggregate FILTER clause:

SELECT p.id
     , p.name
     , array_agg(pc.category_id) AS category_ids
     , string_agg(c.name, ', ')  AS category_names  -- regexp_replace .. ?
     , min(c.name) FILTER (WHERE c.parent_id IS NULL) AS parent_category
FROM   products                p
JOIN   product_categorizations pc ON pc.product_id = p.id
JOIN   categories              c  ON pc.category_id = c.id
GROUP  BY p.id;

See:

Aggregate columns with additional (distinct) filters

(Why would you add AND name IS NOT NULL? Either way, min() ignores NULL values anyway.)

While aggregating all products, and while referential integrity is enforced, this should be a bit faster:

SELECT p.name, pc.*
FROM   products p
JOIN  (
   SELECT pc.product_id AS id
        , array_agg(pc.category_id) AS category_ids
        , string_agg(c.name, ', ')  AS category_names
        , min(c.name) FILTER (WHERE c.parent_id IS NULL) AS parent_category
   FROM   product_categorizations pc
   JOIN   categories              c  ON pc.category_id = c.id
   GROUP  BY 1
   ) pc  USING (id);

The point being that product only joins after aggregating rows.

Aside: "name" is not a very helpful column name. Related:

How to implement a many-to-many relationship in PostgreSQL?