sqlsnowflake-cloud-data-platformrankdense-rank

Assign ranking to IPs based on frequency, within rolling windows


Within each ID, I'm trying to rank each IP based on how often they show up within the last 3 months. For example, in the table below, the IP 98.28.88.148 should have a ranking of 1 because they showed up the most within the last 3 months. IP 166.194.154.19 should have rank of 2, and IP 64.227.19.120 should have rank of 3 since it's least frequent. If 2 IPs show up the same number of times, they should be assigned the same Rank.

ID Login_TS_Three_Months_Prior Login_TS IP Count_of_Logins_Three_Months_Prior
1234 2023-08-04T13:40:08.512Z 2023-11-02T13:40:08.512Z 64.227.19.120 0
1234 2023-08-04T14:35:46.717Z 2023-11-02T14:35:46.717Z 64.227.19.120 1
1234 2023-08-06T00:56:25.895Z 2023-11-04T00:56:25.895Z 98.28.88.148 0
1234 2023-08-10T15:50:21.845Z 2023-11-08T15:50:21.845Z 98.28.88.148 1
1234 2023-08-11T17:07:28.623Z 2023-11-09T17:07:28.623Z 98.28.88.148 2
1234 2023-08-12T16:52:18.59Z 2023-11-10T16:52:18.59Z 98.28.88.148 3
1234 2023-08-18T23:17:12.471Z 2023-11-16T23:17:12.471Z 166.194.154.19 0
1234 2023-08-22T20:42:57.979Z 2023-11-20T20:42:57.979Z 166.194.154.19 1
1234 2023-08-25T22:36:52.638Z 2023-11-23T22:36:52.638Z 166.194.154.19 2

The ideal output should look like this:

ID Login_TS_Three_Months_Prior Login_TS IP Count_of_Logins_Three_Months_Prior IP_Ranking
1234 2023-08-04T13:40:08.512Z 2023-11-02T13:40:08.512Z 64.227.19.120 0 3
1234 2023-08-04T14:35:46.717Z 2023-11-02T14:35:46.717Z 64.227.19.120 1 3
1234 2023-08-06T00:56:25.895Z 2023-11-04T00:56:25.895Z 98.28.88.148 0 1
1234 2023-08-10T15:50:21.845Z 2023-11-08T15:50:21.845Z 98.28.88.148 1 1
1234 2023-08-11T17:07:28.623Z 2023-11-09T17:07:28.623Z 98.28.88.148 2 1
1234 2023-08-12T16:52:18.59Z 2023-11-10T16:52:18.59Z 98.28.88.148 3 1
1234 2023-08-18T23:17:12.471Z 2023-11-16T23:17:12.471Z 166.194.154.19 0 2
1234 2023-08-22T20:42:57.979Z 2023-11-20T20:42:57.979Z 166.194.154.19 1 2
1234 2023-08-25T22:36:52.638Z 2023-11-23T22:36:52.638Z 166.194.154.19 2 2

Solution

  • if we get your data via a CTE, only keeping the id, timestamp, & ip address (and adding second batch with a draw in it):

    with question_data as (
        select 
            $1 as id
            ,try_to_timestamp_ntz($2, 'yyyy-mm-dd"T"hh24:mi:ss.ff"Z"') as login_ts
            ,$3 as ip
        from values
        (1234, '2023-11-02T13:40:08.512Z', '64.227.19.120'),
        (1234, '2023-11-02T14:35:46.717Z', '64.227.19.120'),
        (1234, '2023-11-04T00:56:25.895Z', '98.28.88.148'),
        (1234, '2023-11-08T15:50:21.845Z', '98.28.88.148'),
        (1234, '2023-11-09T17:07:28.623Z', '98.28.88.148'),
        (1234, '2023-11-10T16:52:18.59Z', '98.28.88.148'),
        (1234, '2023-11-16T23:17:12.471Z', '166.194.154.19'),
        (1234, '2023-11-20T20:42:57.979Z', '166.194.154.19'),
        (1234, '2023-11-23T22:36:52.638Z', '166.194.154.19'),
    
        (1235, '2023-11-02T13:40:08.512Z', '64.227.19.120'),
        (1235, '2023-11-02T14:35:46.717Z', '64.227.19.120'),
        (1235, '2023-11-04T00:56:25.895Z', '98.28.88.148'),
        (1235, '2023-11-08T15:50:21.845Z', '98.28.88.148'),
        (1235, '2023-11-09T17:07:28.623Z', '98.28.88.148'),
        (1235, '2023-11-10T16:52:18.59Z', '98.28.88.148'),
        (1235, '2023-11-20T20:42:57.979Z', '166.194.154.19'),
        (1235, '2023-11-23T22:36:52.638Z', '166.194.154.19')
    )
    

    we first want to do the count (well really we might want to do the filter, but I assume you have this bit sorted, so we will can count). Using a window function form of COUNT:

        select *, 
            count(*)over(partition by id, ip) as ip_count
        from question_data
    

    those id/ip batches counted.

    enter image description here

    now for ranking there are three forms, ROW_NUMBER, DENSE_RANK and RANK. The first gives every item it's own number, the second gives duplicates equal numbers, and increases by 1, the third gives equals the same, but the 100th RANK item will have only 100 values prior to it (this is like the Olympic's a draw for first gives 1,1,3), it sounds like what you want is the second.

    Now given we are going to run this in a window function fashion again, it needs the prior window function for the count, to be at a different query layer (a sub-select), thus some code show the second/third options:

    select d.*
        ,dense_rank() over(partition by d.id order by ip_count desc) as ip_ranking_d
        ,rank() over(partition by d.id order by ip_count desc) as ip_ranking_s
    from (
        select *, 
            count(*)over(partition by id, ip) as ip_count
        from question_data
    ) as d
    order by d.id, d.ip, d.login_ts;
    

    enter image description here

    here we see in the 1235 id batch, the two IP's with a count of 2 are 2nd equal. Now the question becomes what do you want to happen to the next count, if you want that to be 3, this method works just fine. But if you want the next number after duplicates to be 4, we need a method of allocating sparse values, and join those results.

    so if the above code is what you want, then the clean form looks like:

    select d.* exclude(ip_count)
        ,dense_rank() over(partition by d.id order by ip_count desc) as ip_ranking_d
    from (
        select *, 
            count(*)over(partition by id, ip) as ip_count
        from question_data
    ) as d
    

    but if we want the shared count, but correctly counted gaps:

    select *
    from question_data
    natural join (
        select 
            id,
            ip,
            rank() over (partition by id order by ip_count desc) as ip_rank
        from (
            select         
                id,
                ip,
                count(*) as ip_count
            from question_data
            group by 1,2
        )
    )
    order by 1,2,3;
    

    so with an extra row of data:

        (1235, '2023-11-23T22:36:52.638Z', '166.194.154.18')
    

    the dense form allocation 3:

    enter image description here

    where-as the sparse version with the join allocates 4:

    enter image description here

    with more time consideration:

    so the last 3 months of data per row starts with this:

    with question_data as (
        select 
            $1 as id
            ,try_to_timestamp_ntz($2, 'yyyy-mm-dd"T"hh24:mi:ss.ff"Z"') as login_ts
            ,$3 as ip
        from values
        (1234, '2023-11-02T00:00:00.000Z', 'a'),
        (1234, '2023-10-02T00:00:00.000Z', 'a'),
        (1234, '2023-09-02T00:00:00.000Z', 'a'),
        (1234, '2023-08-02T00:00:00.000Z', 'a'),
    
        (1234, '2023-11-16T00:00:00.000Z', 'b'),
        (1234, '2023-10-16T00:00:00.000Z', 'b'),
        (1234, '2023-09-16T00:00:00.000Z', 'b'),
        (1234, '2023-08-16T00:00:00.000Z', 'b')
    ), enrich as (
        select *, seq8() as s
        from question_data
    )
    select a.*
        ,b.ip as b_ip
        ,b.s as b_s
        ,b.login_ts as b_login_ts
    from enrich as a
    left join enrich as b
        on a.id = b.id 
        and b.login_ts between dateadd('month', -3, a.login_ts) and a.login_ts
    order by 4,6;
    

    enter image description here

    then we want count these rows:

    ), three_months as (
        select a.*
            ,b.ip as b_ip
            ,b.s as b_s
            ,b.login_ts as b_login_ts
        from enrich as a
        left join enrich as b
            on a.id = b.id 
            and b.login_ts between dateadd('month', -3, a.login_ts) and a.login_ts
    )
    select id, 
        login_ts, 
        ip, 
        s, -- does not seem to be needed anymore.. not sure where I was going with that
        b_ip, 
        count(*) as b_ip_count
    from three_months
    group by 1,2,3,4,5
    

    then we want to rank those:

    ), counted as (
        select id, 
            login_ts, 
            ip, 
            s, -- s is used for the ranking partition..
            b_ip, 
            count(*) as b_ip_count
        from three_months
        group by 1,2,3,4,5
    )
    select *
        ,rank() over (partition by id, s order by b_ip_count desc) as sparse_rank
        ,dense_rank() over (partition by id, s order by b_ip_count desc) as dense_rank   
    from counted
    order by 1,4,5;
    

    enter image description here

    now we want to throw away the lines, where ip <> b_ip as those are the other ranked lines:

    thus we now have the three month window from this row, ranked:

    with question_data as (
        select 
            $1 as id
            ,try_to_timestamp_ntz($2, 'yyyy-mm-dd"T"hh24:mi:ss.ff"Z"') as login_ts
            ,$3 as ip
        from values
        (1234, '2023-11-02T00:00:00.000Z', 'a'),
        (1234, '2023-10-02T00:00:00.000Z', 'a'),
        (1234, '2023-09-02T00:00:00.000Z', 'a'),
        (1234, '2023-08-02T00:00:00.000Z', 'a'),
    
        (1234, '2023-11-16T00:00:00.000Z', 'b'),
        (1234, '2023-10-16T00:00:00.000Z', 'b'),
        (1234, '2023-09-16T00:00:00.000Z', 'b'),
        (1234, '2023-08-16T00:00:00.000Z', 'b')
    ), enrich as (
        select *, seq8() as s
        from question_data
    ), three_months as (
        select a.*
            ,b.ip as b_ip
            ,b.s as b_s
            ,b.login_ts as b_login_ts
        from enrich as a
        left join enrich as b
            on a.id = b.id 
            and b.login_ts between dateadd('month', -3, a.login_ts) and a.login_ts
    ), counted as (
        select id, 
            login_ts, 
            ip, 
            s, -- s is used for the ranking partition..
            b_ip, 
            count(*) as b_ip_count
        from three_months
        group by 1,2,3,4,5
    )
    select c.* exclude (s, b_ip, b_ip_count)
        ,rank() over (partition by id, s order by b_ip_count desc) as sparse_rank
        ,dense_rank() over (partition by id, s order by b_ip_count desc) as dense_rank   
    from counted as c
    qualify ip = b_ip
    order by 1,4,5;
    

    giving:

    enter image description here

    Now, clearly for your data, it will not all be ones, and you still need to pick dense or sparse. But this code seem to be what you are wanting.