sqlamazon-redshifthavingrolling-computationdistinct-values

I need to SQL query a rolling 90-day distinct count of ids in a table, only counting ids that appear 3 or more times (at least 3 interactions)


I have an 'interactions' table that basically has a row for each time an id interacted. So there can be many repeated ids:

id dateinteracted
1651 September 15, 2017, 3:07 PM
1366 June 28, 2017, 8:32 PM
1366 July 3, 2017, 6:26 PM
1366 July 3, 2017, 6:22 PM
1212 August 24, 2017, 1:05 PM
1366 September 18, 2017, 9:15 AM

I need to pull a rolling 90-day distinct count of ids, but only count ids that appear 3 or more times in that 90 days (high engagement). I'm having trouble converting a 90 day lookback query for 1 day into a rolling one that pulls for each passing day of the year.

(I'm pulling data from Redshift btw) First I wrote a sql query to get a distinct count of IDs from the last 90 days where the id interacted at least 3 times to be counted:

WITH cte AS
(
SELECT
id
,COUNT(id) "Count"
FROM interactions
WHERE 1=1
    AND dateinteracted >= (current_date - 90)
    AND dateinteracted < current_date
GROUP BY id
HAVING COUNT(id) >= 3
)

SELECT
COUNT(DISTINCT(id)) "DistinctCount"
FROM cte

I then tried to convert that to pull on a rolling basis. i.e. for each and every day I want the distinct count from the prior 90 days where engagement >=3. (a window function with RANGE is not available to me due to pulling this data from Amazon Redshift where it isn't supported):

WITH cte AS (
SELECT
    A.dateinteracted,
    A.id,
        (SELECT
            COUNT(B.id)
        FROM interactions AS B
        WHERE 1=1
        AND B.id = A.id
        AND B.dateinteracted >= (A.dateinteracted - 90)
        AND B.dateinteracted < A.dateinteracted
        ) AS RollingCount
FROM interactions AS A
)

SELECT
CAST(dateinteracted as Date) "Date",
COUNT( DISTINCT(id)) "DistinctCount"
FROM cte
WHERE 1=1
    AND RollingCount >= 3
GROUP BY 
    CAST(dateinteracted as Date)
ORDER BY 1 DESC

This code seems close, but looks to be undercounting. I'm not positive but I think the issue is that for each day, it is only looking back 90 days for ids that happen to come in on that day, and not all ids from the previous 90 days from that date. Any help would be massive thank you!


Solution

  • I will suppose solution without window function. Idea is self join table and count records with for each record and than for each date.

    create table interactions( id int,  dateinteracted datetime);
    insert into interactions values
     (1401,'2017-05-29 09:16:00')
    ,(1401,'2017-06-28 09:11:00')
    ,(1501,'2017-06-02 09:16:00')
    ,(1366,'2017-07-03 18:26:00')
    ,(1366,'2017-07-03 18:22:00')
    ,(1366,'2017-07-28 20:32:00')
    ,(1401,'2017-07-28 09:11:00')
    ,(1212,'2017-08-24 13:05:00')
    ,(1651,'2017-09-15 15:07:00')
    ,(1366,'2017-09-18 09:15:00')
    ,(1366,'2017-10-02 09:16:00')
    ,(1401,'2017-10-02 09:16:00')
    ;
    with gr as(
    select t1.id,t1.dateinteracted
      ,count(*) cnt
    from interactions t1
    left join interactions t2 
      on t1.id=t2.id 
      and t2.dateinteracted<t1.dateinteracted 
      and t2.dateinteracted>=(t1.dateinteracted-90)
    group by t1.id,t1.dateinteracted
    )
    select cast(dateinteracted as date) cdate
      ,sum(case when cnt>=2 then 1 else 0 end) cnt_in_day
    from gr
    where cnt>=2
    group by cast(dateinteracted as date)
    

    Cte result

    id dateinteracted cnt
    1212 2017-08-24 13:05:00.000 1
    1366 2017-07-03 18:22:00.000 1
    1366 2017-07-03 18:26:00.000 1
    1366 2017-07-28 20:32:00.000 2
    1366 2017-09-18 09:15:00.000 3
    1366 2017-10-02 09:16:00.000 2
    1401 2017-05-29 09:16:00.000 1
    1401 2017-06-28 09:11:00.000 1
    1401 2017-07-28 09:11:00.000 2
    1401 2017-10-02 09:16:00.000 1
    1501 2017-06-02 09:16:00.000 1
    1651 2017-09-15 15:07:00.000 1

    Query result

    cdate cnt_in_day
    2017-07-28 2
    2017-09-18 1
    2017-10-02 1

    Or full result (without condition where cnt>=2)

    cdate cnt_in_day
    2017-05-29 0
    2017-06-02 0
    2017-06-28 0
    2017-07-03 0
    2017-07-28 2
    2017-08-24 0
    2017-09-15 0
    2017-09-18 1
    2017-10-02 1

    Example here Although query test example is for SQL Server, nothing specific is used, i think.

    I'll update answer. New example counts activity with lag 90 days for every day in calendar. Calendar (dates) is list all of days from min(dateinteracted) to max(dateinteracted). You may use other calendar, if if necessary.
    Next cte (act_in_day) joins all activities to calendar row with interactions on conditions dateinteracted from today to previous 90 days.

    with dates as(
      select 1 lvl,cast(min(cast(dateinteracted as date)) as datetime) as dt
          ,cast(max(cast(dateinteracted as date)) as datetime) maxdt from interactions
      union all
      select lvl+1,dt+1,maxdt from dates where dt+1<=maxdt --and lvl<10
    )
    ,act_in_day as(
    select * 
    from dates d
    left join interactions i 
      on   cast(d.dt as date)>=cast(i.dateinteracted as date)
      and cast(d.dt as date)<cast((i.dateinteracted+90) as date)
    )
    ,id_act_in_day as(
    select id,dt
      ,count(*) cnt
    from act_in_day
    group by id,dt
    )
    ,act_by_day as(
    select dt
      ,sum(case when cnt>=3 then 1 else 0 end) cnt_in_day
    from id_act_in_day
    --where cnt>=2
    group by dt
    )
    select * from act_by_day
    order by dt
    option (maxrecursion 300);
      
    

    New example here