sqlcountsnowflake-cloud-data-platformwindow-functionssnowsql

Bucketing Data, If Bucket Size Greater Then 1 Set Value To 1 Else 0


I'm using snowflake to attempt to bucket rows based on two columns and if that bucket size is greater then 1 then I need the value of of some new column to be 1, otherwise I would set it to 0. It's similar to NTILE() except that I would need this to be dynamic.

The table that I have looks like the following:

ARTICLE_ID      USER_ID     COUNTRY_NAME    NUM_COUNTRIES   NUM_AUTHORS
------------------------------------------------------------------------
30522143        ecf330d     Singapore       1               4
30522143        86b0faa     Singapore       1               4
30522143        d331b33     Singapore       1               4
30522143        d331b33     Singapore       1               4
------------------------------------------------------------------------
30522139        1d08cfd     United States   2               3
30522139        ec9c065     Canada          2               3
30522139        a750bff     United States   2               3
------------------------------------------------------------------------
23410385        1d08cfd     United States   3               3
23410385        ec9c065     Mexico          3               3
23410385        a41b19c     France          3               3
------------------------------------------------------------------------
30285132        d331b33     Chile           2               4
30285132        1d08cf2     Peru            2               4
30285132        a750bff     Chile           2               4
30285132        d742bb5     Peru            2               4
------------------------------------------------------------------------
...

The two columns that I would need to bucket are by ARTICLE_ID then COUNTRY_NAME. Looking at the above table if we take ARTICLE_ID 30522143 we would have 1 bucket of 4 USER_ID since they are all Singapore then the value of HAS_REPEATED_COUNTRY would be 1.

In the case of ARTICLE_ID 30522139 we would have two sub-buckets one for United States containing 2 USER_IDs and another sub-bucket (Canada) containing 1 USER_ID which would result in all those users associated with United States to have a value of 1 and the sole member associated with Canada to have a value of 0 in HAS_REPEATED_COUNTRY.

Hence the table would resemble:

ARTICLE_ID      USER_ID     COUNTRY_NAME    NUM_COUNTRIES   NUM_AUTHORS     HAS_REPEATED_COUNTRY
------------------------------------------------------------------------------------------------
30522143        ecf330d     Singapore       1               4               1
30522143        86b0faa     Singapore       1               4               1
30522143        d331b33     Singapore       1               4               1
30522143        d331b33     Singapore       1               4               1
------------------------------------------------------------------------------------------------
30522139        1d08cfd     United States   2               3               1
30522139        ec9c065     Canada          2               3               0
30522139        a750bff     United States   2               3               1
------------------------------------------------------------------------------------------------
23410385        1d08cfd     United States   3               3               0
23410385        ec9c065     Mexico          3               3               0
23410385        a41b19c     France          3               3               0
------------------------------------------------------------------------------------------------
30285132        d331b33     Chile           2               4               1
30285132        1d08cf2     Peru            2               4               1
30285132        a750bff     Chile           2               4               1
30285132        d742bb5     Peru            2               4               1
------------------------------------------------------------------------------------------------
...

Is there any way of producing the latter table from the former?


Solution

  • It seems like you want to flag countries that appear more than once for a given article. If so, you can just use a window count:

    select 
        t.*,
        case when count(*) over(partition by article_id, country_name) > 1
            then 1 else 0
        end as has_repeated_country
    from mytable t