I am working with Netezza SQL. I have the following table:
CREATE TABLE MY_TABLE (
id VARCHAR(50),
year VARCHAR(50)
);
INSERT INTO MY_TABLE (id, year)
VALUES ('123', '2010');
INSERT INTO MY_TABLE (id, year)
VALUES ('123', '2011');
INSERT INTO MY_TABLE (id, year)
VALUES ('123', '2012');
INSERT INTO MY_TABLE (id, year)
VALUES ('124', '2011');
INSERT INTO MY_TABLE (id, year)
VALUES ('124', '2012');
INSERT INTO MY_TABLE (id, year)
VALUES ('124', '2013');
INSERT INTO MY_TABLE (id, year)
VALUES ('125', '2011');
INSERT INTO MY_TABLE (id, year)
VALUES ('125', '2012');
INSERT INTO MY_TABLE (id, year)
VALUES ('125', '2013');
INSERT INTO MY_TABLE (id, year)
VALUES ('127', '2011');
INSERT INTO MY_TABLE (id, year)
VALUES ('127', '2012');
INSERT INTO MY_TABLE (id, year)
VALUES ('127', '2015');
INSERT INTO MY_TABLE (id, year)
VALUES ('126', '2019');
My Question: I am trying to find out the number of times each combination of years appears. The final result would look like this:
combination freq
1 2011,2012,2015 1
2 2011,2012,2013 2
3 2010, 2011, 2012 1
4 2019 1
Here is my attempt at doing this in Netezza:
WITH CTE AS (
SELECT id, year, ROW_NUMBER() OVER (PARTITION BY id ORDER BY year) AS rn
FROM MY_TABLE
),
CTE2 AS (
SELECT id, MAX(rn) AS max_rn
FROM CTE
GROUP BY id
),
CTE3 AS (
SELECT CTE2.id, CTE.year, CTE.rn, CTE2.max_rn
FROM CTE2
JOIN CTE ON CTE2.id = CTE.id
),
CTE4 AS (
SELECT id,
MAX(CASE WHEN rn = 1 THEN year END) ||
MAX(CASE WHEN rn = 2 THEN ',' || year END) ||
MAX(CASE WHEN rn = 3 THEN ',' || year END) ||
MAX(CASE WHEN rn = 4 THEN ',' || year END) ||
MAX(CASE WHEN rn = 5 THEN ',' || year END) ||
MAX(CASE WHEN rn = 6 THEN ',' || year END) ||
MAX(CASE WHEN rn = 7 THEN ',' || year END) ||
MAX(CASE WHEN rn = 8 THEN ',' || year END) ||
MAX(CASE WHEN rn = 9 THEN ',' || year END) ||
MAX(CASE WHEN rn = 10 THEN ',' || year END)
AS combination
FROM CTE3
GROUP BY id
)
SELECT combination, COUNT(*) AS freq
FROM CTE4
GROUP BY combination
ORDER BY freq DESC;
But the calculations are completely wrong:
combination freq
1 <NA> 94
2 2010,2012,2013,2014,2015,2016,2017,2018,2019,2020 1
3 2010,2011,2012,2013,2014,2015,2016,2017,2019,2020 1
4 2010,2011,2012,2013,2014,2016,2017,2018,2019,2020 1
5 2010,2011,2012,2014,2015,2016,2017,2018,2019,2020 1
6 2010,2011,2013,2014,2015,2016,2017,2018,2019,2020 1
7 2010,2011,2012,2013,2014,2015,2016,2017,2018,2019 1
Can someone please show me how to fix this?
Thanks!
Note 1: I am using rn = 10 to account for the min/max range of years (which in my case is 10). I only have years 2010, 2011, ... 2020 in my dataset. Based on this information - is it possible to write a manual solution for individual years using UNION ALL style statements?
Note 2: Unfortunately, there is no GROUP_CONCAT function in Netezza - otherwise the following code would have worked:
Here is a sample:
SELECT GROUP_CONCAT(year, ',' ORDER BY year) AS combination, COUNT(*) AS freq
FROM my_data
GROUP BY id
ORDER BY freq DESC;
Source: Counting Number of Unique Column Values Per Group
agg <- aggregate(year ~ id, my_table, paste, collapse = ", ")
final = as.data.frame(table(agg$year))
Since Netezza lacks LISTAGG, STRING_AGG, and GROUP_CONCAT, this is a bit more challenging.
Reading that Netezza software was based on PostgreSQL 7.2, I created this against 9.6 -- the oldest available on dbfiddle.uk.
CREATE TABLE MY_TABLE (
id VARCHAR(50),
year VARCHAR(50)
);
INSERT INTO MY_TABLE (id, year) VALUES ('123', '2010');
INSERT INTO MY_TABLE (id, year) VALUES ('123', '2011');
INSERT INTO MY_TABLE (id, year) VALUES ('123', '2012');
INSERT INTO MY_TABLE (id, year) VALUES ('124', '2011');
INSERT INTO MY_TABLE (id, year) VALUES ('124', '2012');
INSERT INTO MY_TABLE (id, year) VALUES ('124', '2013');
INSERT INTO MY_TABLE (id, year) VALUES ('125', '2011');
INSERT INTO MY_TABLE (id, year) VALUES ('125', '2012');
INSERT INTO MY_TABLE (id, year) VALUES ('125', '2013');
INSERT INTO MY_TABLE (id, year) VALUES ('127', '2011');
INSERT INTO MY_TABLE (id, year) VALUES ('127', '2012');
INSERT INTO MY_TABLE (id, year) VALUES ('127', '2015');
INSERT INTO MY_TABLE (id, year) VALUES ('126', '2019');
WITH CTE AS (
SELECT id, year, ROW_NUMBER() OVER (PARTITION BY id ORDER BY year) AS rn
FROM MY_TABLE
),
CTE3 AS (
select id
, year
, rn
, max(rn) over (partition by id) as MaxRN
from cte
),
CTE4 AS (
SELECT id
, coalesce(max(CASE WHEN rn = 1 THEN year END), '') ||
coalesce(max(CASE WHEN rn = 2 THEN ',' || year END), '') ||
coalesce(max(CASE WHEN rn = 3 THEN ',' || year END), '') ||
coalesce(max(CASE WHEN rn = 4 THEN ',' || year END), '') ||
coalesce(max(CASE WHEN rn = 5 THEN ',' || year END), '') ||
coalesce(max(CASE WHEN rn = 6 THEN ',' || year END), '') ||
coalesce(max(CASE WHEN rn = 7 THEN ',' || year END), '') ||
coalesce(max(CASE WHEN rn = 8 THEN ',' || year END), '') ||
coalesce(max(CASE WHEN rn = 9 THEN ',' || year END), '') ||
coalesce(max(CASE WHEN rn = 10 THEN ',' || year END), '')
AS combination
FROM CTE3
GROUP BY id
)
SELECT combination
, count(id) as freq
FROM CTE4
GROUP BY combination
COALESCE()
is needed because SELECT 'a' || NULL
returns NULL. Without this the result would be
combination | freq |
---|---|
null | 5 |