I am trying to get an array of categories associated with each product and then also get the top-level parent category of each product in another column, which by my logic is finding the same values for the categories array, but only selecting where parent_id is NULL
which should pull back only one value and 1 record per id.
I really don't know the best way to structure this query. What I have kind of works, but it also shows NULL values in the parent category column for the categories that do have a parent ID and makes a second record for each product because I am forced to put it in the group by. Basically, I think I am not doing this in the correct or most efficient way.
Desired result:
+----+----------------+------------------+------------------------------------------------+------------------+
| id | name | category_ids | category_names | parent_category |
+----+----------------+------------------+------------------------------------------------+------------------+
| 1 | Product Name 1 | {111,222,333} | {Electronics, computers, computer accessories} | Electronics |
+----+----------------+------------------+------------------------------------------------+------------------+
My current query (which is not ideal):
select p.id,
p.name,
array_agg(category_id) as category_ids,
regexp_replace(array_agg(c.name)::text,'"|''','','gi') as category_names,
c1.name as parent_category
from products p
join product_categorizations pc on pc.product_id = p.id
join categories c on pc.category_id = c.id
full outer join (
select name, id from categories
where parent_id is null and name is not null
) c1 on c.id = c1.id
group by 1,2,5;
+----+----------------+------------------+-----------------------------------+------------------+
| id | name | category_ids | category_names | parent_category |
+----+----------------+------------------+-----------------------------------+------------------+
| 1 | Product Name 1 | {111} | {Electronics} | Electronics |
+----+----------------+------------------+-----------------------------------+------------------+
| 1 | Product Name 1 | {222,333} | {computers, computer accessories} | NULL |
+----+----------------+------------------+-----------------------------------+------------------+
Replace the FULL JOIN
with an aggregate FILTER
clause:
SELECT p.id
, p.name
, array_agg(pc.category_id) AS category_ids
, string_agg(c.name, ', ') AS category_names -- regexp_replace .. ?
, min(c.name) FILTER (WHERE c.parent_id IS NULL) AS parent_category
FROM products p
JOIN product_categorizations pc ON pc.product_id = p.id
JOIN categories c ON pc.category_id = c.id
GROUP BY p.id;
See:
(Why would you add AND name IS NOT NULL
? Either way, min()
ignores NULL
values anyway.)
While aggregating all products, and while referential integrity is enforced, this should be a bit faster:
SELECT p.name, pc.*
FROM products p
JOIN (
SELECT pc.product_id AS id
, array_agg(pc.category_id) AS category_ids
, string_agg(c.name, ', ') AS category_names
, min(c.name) FILTER (WHERE c.parent_id IS NULL) AS parent_category
FROM product_categorizations pc
JOIN categories c ON pc.category_id = c.id
GROUP BY 1
) pc USING (id);
The point being that product
only joins after aggregating rows.
Aside: "name" is not a very helpful column name. Related: