sqlamazon-redshiftamazon-redshift-serverless

Setting Result rows dynamically in dependency of cell content


I have to tables in Redshift, articles and clicks:

articles

| articleID | authorID |
| 100       |   2      |
| 101       |   2      |
| 102       |   6      |
| 103       |   7      |
| 104       |   2      |

clicks

|articleID  | category |
|100        | "mail"   |
|100        | "mail"   |
|100        | "rss"    |
|101        | "rss"    |
|101        | "mail"   |
|101        | "app"    |
|101        | "app"    |

Now I want to aggregate for each articleID the number of clicks per category as well as the total amount of clicks with this query

SELECT clicks.articleID,
ANY_VALUE(article.authorID) AS authorID,

COUNT(CASE WHEN clicks.category ='rss' THEN 1 END) as rss,
COUNT(CASE WHEN clicks.category ='mail' THEN 1 END) as mail,
COUNT(CASE WHEN clicks.category ='app' THEN 1 END) as app,

COUNT(clicks.articleID) AS total
FROM clicks
INNER JOIN articles
ON clicks.articleID = articles.articleID

GROUP BY clicks.articleID

It results in

| articleID | authorID  | rss   | mail  | app   | total |
| 100       | 2         | 1     | 2     | 0     |   3   |
| 101       | 2         | 1     | 1     | 2     |   4   |

which is fine. Now I want to set the category colums dynamically. So when an additional category in "clicks" appears, e.g. "facebook", I do not have to alter the query for getting

| articleID | authorID  | rss   | mail  | app   | facebook  | total |
| 100       | 2         | 1     | 2     | 0     |   1       |   4   |
| 101       | 2         | 1     | 1     | 2     |   0       |   4   |

I found this one

WITH categories AS (SELECT category from clicks GROUP BY category)

as a prefix to the query But how do I access categories in the main query, something like

WITH categories AS (SELECT category from clicks GROUP BY category)

SELECT clicks.articleID,
ANY_VALUE(article.authorID) AS authorID,

--pseudocode
FOREACH(categories AS category)
    COUNT(CASE WHEN clicks.category =$category THEN 1 END) as $category
ENDFOREACH
--//pseudocode


COUNT(clicks.articleID) AS total
FROM clicks
INNER JOIN articles
ON clicks.articleID = articles.articleID

GROUP BY clicks.articleID

?

Of course I could first query all categories and then run a query for each category, but I hope that there is a more elegant way


Solution

  • Two things - first there is now PIVOT functionality in Redshift so you can don't need all those case statements. So it would look like:

    select * 
    from ( 
      select category, c.articleid, authorid 
      from clicks c
      join articles a on c.articleid = a.articleid) 
        PIVOT ( count(category) for category in ('rss', 'mail', 'app') );
    

    The second thing is that you want dynamic SQL. You see the list of categories is "baked" into the SQL either as the PIVOT list or the GROUP BY list or ... this isn't something that can be made dynamic inside the database. So you will need to have something gather the list of categories and then issue the SQL for the final query. Lambda can be this something but so can many other tools. Like so:

    select '\'' || listagg(distinct category, '\',\'') || '\'' from clicks;
    

    then something takes this list and creates the query from a template such as:

    select * 
    from ( 
      select category, c.articleid, authorid 
      from clicks c
      join articles a on c.articleid = a.articleid) 
        PIVOT ( count(category) for category in (<INSERT RESULT FROM ABOVE HERE>) );
    

    Sorry this is the best you can do. Queries are compiled BEFORE any data is seen.