Within each ID, I'm trying to rank each IP based on how often they show up within the last 3 months. For example, in the table below, the IP 98.28.88.148 should have a ranking of 1 because they showed up the most within the last 3 months. IP 166.194.154.19 should have rank of 2, and IP 64.227.19.120 should have rank of 3 since it's least frequent. If 2 IPs show up the same number of times, they should be assigned the same Rank.
ID | Login_TS_Three_Months_Prior | Login_TS | IP | Count_of_Logins_Three_Months_Prior |
---|---|---|---|---|
1234 | 2023-08-04T13:40:08.512Z | 2023-11-02T13:40:08.512Z | 64.227.19.120 | 0 |
1234 | 2023-08-04T14:35:46.717Z | 2023-11-02T14:35:46.717Z | 64.227.19.120 | 1 |
1234 | 2023-08-06T00:56:25.895Z | 2023-11-04T00:56:25.895Z | 98.28.88.148 | 0 |
1234 | 2023-08-10T15:50:21.845Z | 2023-11-08T15:50:21.845Z | 98.28.88.148 | 1 |
1234 | 2023-08-11T17:07:28.623Z | 2023-11-09T17:07:28.623Z | 98.28.88.148 | 2 |
1234 | 2023-08-12T16:52:18.59Z | 2023-11-10T16:52:18.59Z | 98.28.88.148 | 3 |
1234 | 2023-08-18T23:17:12.471Z | 2023-11-16T23:17:12.471Z | 166.194.154.19 | 0 |
1234 | 2023-08-22T20:42:57.979Z | 2023-11-20T20:42:57.979Z | 166.194.154.19 | 1 |
1234 | 2023-08-25T22:36:52.638Z | 2023-11-23T22:36:52.638Z | 166.194.154.19 | 2 |
The ideal output should look like this:
ID | Login_TS_Three_Months_Prior | Login_TS | IP | Count_of_Logins_Three_Months_Prior | IP_Ranking |
---|---|---|---|---|---|
1234 | 2023-08-04T13:40:08.512Z | 2023-11-02T13:40:08.512Z | 64.227.19.120 | 0 | 3 |
1234 | 2023-08-04T14:35:46.717Z | 2023-11-02T14:35:46.717Z | 64.227.19.120 | 1 | 3 |
1234 | 2023-08-06T00:56:25.895Z | 2023-11-04T00:56:25.895Z | 98.28.88.148 | 0 | 1 |
1234 | 2023-08-10T15:50:21.845Z | 2023-11-08T15:50:21.845Z | 98.28.88.148 | 1 | 1 |
1234 | 2023-08-11T17:07:28.623Z | 2023-11-09T17:07:28.623Z | 98.28.88.148 | 2 | 1 |
1234 | 2023-08-12T16:52:18.59Z | 2023-11-10T16:52:18.59Z | 98.28.88.148 | 3 | 1 |
1234 | 2023-08-18T23:17:12.471Z | 2023-11-16T23:17:12.471Z | 166.194.154.19 | 0 | 2 |
1234 | 2023-08-22T20:42:57.979Z | 2023-11-20T20:42:57.979Z | 166.194.154.19 | 1 | 2 |
1234 | 2023-08-25T22:36:52.638Z | 2023-11-23T22:36:52.638Z | 166.194.154.19 | 2 | 2 |
if we get your data via a CTE, only keeping the id, timestamp, & ip address (and adding second batch with a draw in it):
with question_data as (
select
$1 as id
,try_to_timestamp_ntz($2, 'yyyy-mm-dd"T"hh24:mi:ss.ff"Z"') as login_ts
,$3 as ip
from values
(1234, '2023-11-02T13:40:08.512Z', '64.227.19.120'),
(1234, '2023-11-02T14:35:46.717Z', '64.227.19.120'),
(1234, '2023-11-04T00:56:25.895Z', '98.28.88.148'),
(1234, '2023-11-08T15:50:21.845Z', '98.28.88.148'),
(1234, '2023-11-09T17:07:28.623Z', '98.28.88.148'),
(1234, '2023-11-10T16:52:18.59Z', '98.28.88.148'),
(1234, '2023-11-16T23:17:12.471Z', '166.194.154.19'),
(1234, '2023-11-20T20:42:57.979Z', '166.194.154.19'),
(1234, '2023-11-23T22:36:52.638Z', '166.194.154.19'),
(1235, '2023-11-02T13:40:08.512Z', '64.227.19.120'),
(1235, '2023-11-02T14:35:46.717Z', '64.227.19.120'),
(1235, '2023-11-04T00:56:25.895Z', '98.28.88.148'),
(1235, '2023-11-08T15:50:21.845Z', '98.28.88.148'),
(1235, '2023-11-09T17:07:28.623Z', '98.28.88.148'),
(1235, '2023-11-10T16:52:18.59Z', '98.28.88.148'),
(1235, '2023-11-20T20:42:57.979Z', '166.194.154.19'),
(1235, '2023-11-23T22:36:52.638Z', '166.194.154.19')
)
we first want to do the count (well really we might want to do the filter, but I assume you have this bit sorted, so we will can count). Using a window function form of COUNT:
select *,
count(*)over(partition by id, ip) as ip_count
from question_data
those id/ip batches counted.
now for ranking there are three forms, ROW_NUMBER, DENSE_RANK and RANK. The first gives every item it's own number, the second gives duplicates equal numbers, and increases by 1, the third gives equals the same, but the 100th RANK item will have only 100 values prior to it (this is like the Olympic's a draw for first gives 1,1,3), it sounds like what you want is the second.
Now given we are going to run this in a window function fashion again, it needs the prior window function for the count, to be at a different query layer (a sub-select), thus some code show the second/third options:
select d.*
,dense_rank() over(partition by d.id order by ip_count desc) as ip_ranking_d
,rank() over(partition by d.id order by ip_count desc) as ip_ranking_s
from (
select *,
count(*)over(partition by id, ip) as ip_count
from question_data
) as d
order by d.id, d.ip, d.login_ts;
here we see in the 1235 id batch, the two IP's with a count of 2 are 2nd equal. Now the question becomes what do you want to happen to the next count, if you want that to be 3, this method works just fine. But if you want the next number after duplicates to be 4, we need a method of allocating sparse values, and join those results.
so if the above code is what you want, then the clean form looks like:
select d.* exclude(ip_count)
,dense_rank() over(partition by d.id order by ip_count desc) as ip_ranking_d
from (
select *,
count(*)over(partition by id, ip) as ip_count
from question_data
) as d
but if we want the shared count, but correctly counted gaps:
select *
from question_data
natural join (
select
id,
ip,
rank() over (partition by id order by ip_count desc) as ip_rank
from (
select
id,
ip,
count(*) as ip_count
from question_data
group by 1,2
)
)
order by 1,2,3;
so with an extra row of data:
(1235, '2023-11-23T22:36:52.638Z', '166.194.154.18')
the dense form allocation 3:
where-as the sparse version with the join allocates 4:
so the last 3 months of data per row starts with this:
with question_data as (
select
$1 as id
,try_to_timestamp_ntz($2, 'yyyy-mm-dd"T"hh24:mi:ss.ff"Z"') as login_ts
,$3 as ip
from values
(1234, '2023-11-02T00:00:00.000Z', 'a'),
(1234, '2023-10-02T00:00:00.000Z', 'a'),
(1234, '2023-09-02T00:00:00.000Z', 'a'),
(1234, '2023-08-02T00:00:00.000Z', 'a'),
(1234, '2023-11-16T00:00:00.000Z', 'b'),
(1234, '2023-10-16T00:00:00.000Z', 'b'),
(1234, '2023-09-16T00:00:00.000Z', 'b'),
(1234, '2023-08-16T00:00:00.000Z', 'b')
), enrich as (
select *, seq8() as s
from question_data
)
select a.*
,b.ip as b_ip
,b.s as b_s
,b.login_ts as b_login_ts
from enrich as a
left join enrich as b
on a.id = b.id
and b.login_ts between dateadd('month', -3, a.login_ts) and a.login_ts
order by 4,6;
then we want count these rows:
), three_months as (
select a.*
,b.ip as b_ip
,b.s as b_s
,b.login_ts as b_login_ts
from enrich as a
left join enrich as b
on a.id = b.id
and b.login_ts between dateadd('month', -3, a.login_ts) and a.login_ts
)
select id,
login_ts,
ip,
s, -- does not seem to be needed anymore.. not sure where I was going with that
b_ip,
count(*) as b_ip_count
from three_months
group by 1,2,3,4,5
then we want to rank those:
), counted as (
select id,
login_ts,
ip,
s, -- s is used for the ranking partition..
b_ip,
count(*) as b_ip_count
from three_months
group by 1,2,3,4,5
)
select *
,rank() over (partition by id, s order by b_ip_count desc) as sparse_rank
,dense_rank() over (partition by id, s order by b_ip_count desc) as dense_rank
from counted
order by 1,4,5;
now we want to throw away the lines, where ip <> b_ip as those are the other ranked lines:
thus we now have the three month window from this row, ranked:
with question_data as (
select
$1 as id
,try_to_timestamp_ntz($2, 'yyyy-mm-dd"T"hh24:mi:ss.ff"Z"') as login_ts
,$3 as ip
from values
(1234, '2023-11-02T00:00:00.000Z', 'a'),
(1234, '2023-10-02T00:00:00.000Z', 'a'),
(1234, '2023-09-02T00:00:00.000Z', 'a'),
(1234, '2023-08-02T00:00:00.000Z', 'a'),
(1234, '2023-11-16T00:00:00.000Z', 'b'),
(1234, '2023-10-16T00:00:00.000Z', 'b'),
(1234, '2023-09-16T00:00:00.000Z', 'b'),
(1234, '2023-08-16T00:00:00.000Z', 'b')
), enrich as (
select *, seq8() as s
from question_data
), three_months as (
select a.*
,b.ip as b_ip
,b.s as b_s
,b.login_ts as b_login_ts
from enrich as a
left join enrich as b
on a.id = b.id
and b.login_ts between dateadd('month', -3, a.login_ts) and a.login_ts
), counted as (
select id,
login_ts,
ip,
s, -- s is used for the ranking partition..
b_ip,
count(*) as b_ip_count
from three_months
group by 1,2,3,4,5
)
select c.* exclude (s, b_ip, b_ip_count)
,rank() over (partition by id, s order by b_ip_count desc) as sparse_rank
,dense_rank() over (partition by id, s order by b_ip_count desc) as dense_rank
from counted as c
qualify ip = b_ip
order by 1,4,5;
giving:
Now, clearly for your data, it will not all be ones, and you still need to pick dense or sparse. But this code seem to be what you are wanting.