sqlsnowflake-cloud-data-platform

Use condition in partition by window in SnowFlake


Some what related: Filtering a Query based on a Date and Window function in Snowflake

I need to create a query that count the number of occurance of an id in a -+ 90 days window, similar to this but as a window function, is that possible?

WITH fake_data(id, DATE) as (
    SELECT * FROM VALUES
    -- this id has visted once
    (1, '2022-04-14'::date),
    -- this id has visited 3 times
    (3, '2022-01-13'::date),
    (3, '2022-03-13'::date),
    (3, '2022-05-13'::date),
    -- this id is a huge vistor
    (5, '2022-01-01'::date),
    (5, '2022-02-01'::date),
    (5, '2022-05-01'::date),
    (5, '2022-06-01'::date),
    (5, '2022-08-01'::date)
)
select * from (
    select 
    count_if("change" between -90 and 90) over (partition by ID, t1.DATE) "c",
    *
    from fake_data as t1
    left outer join lateral (
        select t1.DATE - t.DATE "change", t.DATE "t_DATE" 
        from fake_data AS t
        where t1.id = t.id and t1.DATE - t.DATE between -90 and 9
     ) as t2
    order by ID, t1.DATE, "change"
)
where "change" = 0;

Result (change and t_DATE are just included for reference):

c ID DATE change t_DATE
1 1 2022-04-14 0 2022-04-14
2 3 2022-01-13 0 2022-01-13
3 3 2022-03-13 0 2022-03-13
2 3 2022-05-13 0 2022-04-13
2 5 2022-01-01 0 2022-01-01
3 5 2022-02-01 0 2022-02-01
3 5 2022-05-01 0 2022-05-01
3 5 2022-06-01 0 2022-06-01
2 5 2022-08-01 0 2022-08-01

This is what I like to do but it doesn't seem like the current row's DATE is available (or I can use alias):

select
  count_if(DATE - d between -90 and 90) over (partition by id, DATE  as d) as "c",
  id,
  date
from fake_data;

Solution

  • hmm,

    even though it's not what you want, your SQL can be written:

    select 
        t1.DATE - t.DATE as change
        ,count_if(abs(t1.DATE - t.DATE) <= 90) over (partition by t1.ID, t1.DATE) as c
        ,t1.*
        ,t.date as t_date
    from fake_data as t1
    left join fake_data as t
        on t1.id = t.id and abs(t1.DATE - t.DATE) <= 90
    qualify change = 0
    order by t1.ID, t1.DATE, change
    

    But given the join is the same as your count_if, that can also be written:

    select 
        t1.DATE - t.DATE as change
        ,count(*) over (partition by t1.ID, t1.DATE) as c
        ,t1.*
        ,t.date as t_date
    from fake_data as t1
    left join fake_data as t
        on t1.id = t.id and abs(t1.DATE - t.DATE) <= 90
    qualify change = 0
    order by t1.ID, t1.DATE, change
    

    But given Window functions do not have a "this row" as you wish for the time range, you could work around that by using a Javascript UDTF and for each row build a in memory set, and do the count via that, and then emit that in the final, and then join to that.

    At which point you might as well explode the data and do equi-joins in raw SQL, which for +90,-90 days, is likely to still be rather fast for massive data, as compared to self joins

    Thus for massive data, this should perform much better:

    WITH fake_data(id, DATE) as (
        SELECT * FROM VALUES
        -- this id has visted once
        (1, '2022-04-14'::date),
        -- this id has visited 3 times
        (3, '2022-01-13'::date),
        (3, '2022-03-13'::date),
        (3, '2022-05-13'::date),
        -- this id is a huge vistor
        (5, '2022-01-01'::date),
        (5, '2022-02-01'::date),
        (5, '2022-05-01'::date),
        (5, '2022-06-01'::date),
        (5, '2022-08-01'::date)
    ), range as (
        select row_number() over (order by null)-91 as rn
        from table(generator(ROWCOUNT => 181))
    ), exploded as (
        select
            id, 
            dateadd('day', e.rn, d.date) as t_date
        from fake_data as d
        cross join range as e
    )
    select
        f.*
        ,count(t_date) as c
    from fake_data as f
    join exploded as e
        on f.id = e.id and f.date = t_date
    group by f.id, f.date
    order by f.id, f.DATE
    ;
    
    ID DATE C
    1 2022-04-14 1
    3 2022-01-13 2
    3 2022-03-13 3
    3 2022-05-13 2
    5 2022-01-01 2
    5 2022-02-01 3
    5 2022-05-01 3
    5 2022-06-01 3
    5 2022-08-01 2