pythonpandasgroup-bylookupmembership

Identify records that are present in a particular set and not in another set


I am trying to label rows based on ID and year, If ID is appeared in the years [2017,2018,2019] and not appeared in [2020,2021,2022] then need to label it as 1, else 0.

df1 = pd.DataFrame({'ID': ['AX1', 'AX1', 'AX1','AX1','AX1','AX1','AX2','AX2','AX2','AX3','AX3','AX4','AX4','AX4'],'year':[2017,2018,2019,2020,2021,2022,2019,2020,2022,2019,2020,2017,2018,2019]})

     ID  year
0   AX1  2017
1   AX1  2018
2   AX1  2019
3   AX1  2020
4   AX1  2021
5   AX1  2022
6   AX2  2019
7   AX2  2020
8   AX2  2022
9   AX3  2019
10  AX3  2020
11  AX4  2017
12  AX4  2018
13  AX4  2019

Expected output:

     ID  year  label
0   AX1  2017      0
1   Ax1  2018      0
2   AX1  2019      0
3   AX1  2020      0
4   AX1  2021      0
5   AX1  2022      0
6   AX2  2019      0
7   AX2  2020      0
8   AX2  2022      0
9   AX3  2019      0
10  AX3  2020      0
11  AX4  2017      1
12  AX4  2018      1
13  AX4  2019      1

In the above example ID: AX4 is tagged as 1, because it's the only ID that appears in the first set of years [2017,2018,2019] and not appear in the second set [2020,2021,2022].

How do I achieve this?


Solution

  • Group by ID, check if the wanted years are in and unwanted years are out using set operations and map the result back to df1.

    df1 = pd.DataFrame({
        'ID': ['AX1', 'AX1', 'AX1','AX1','AX1','AX1','AX2','AX2','AX2','AX3','AX3','AX4','AX4','AX4'],
        'year':[2017,2018,2019,2020,2021,2022,2019,2020,2022,2019,2020,2017,2018,2019]
    })
    
    # find group level labels by checking if all of 2017-19 and none of 2020-22 exist for each ID
    labels = (
        df1.groupby('ID')['year']
        .apply(lambda g: {2017,2018,2019}.issubset(g) and not {2020,2021,2022}.intersection(g))
        .astype(int)
    )
    # map group level labels to ID
    df1['labels'] = df1['ID'].map(labels)
    

    Another (a bit more readable code) is to cross-tabulate df1 and check the years across columns. pd.crosstab() sorts the columns (in this case, years), so simple eq() works.

    # cross tabulate and check for years across columns
    labels = (
        pd.crosstab(df1['ID'], df1['year'])
        .eq([1, 1, 1, 0, 0, 0], axis=1)
        .all(axis=1)
        .astype(int)
    )
    # map group level labels to ID
    df1['labels'] = df1['ID'].map(labels)
    df1
    

    Yet another method to get the lookup labels is to make the membership test first on each year in the 'year' column, then group by 'ID'. The aggregator method should be all() because we want the all years of each ID to satisfy the membership test.

    labels = (
        df1.eval('year in [2017, 2018, 2019] and year not in [2020,2021,2022]')
        .groupby(df1['ID'])
        .all()
        .astype(int)
    )
    df1['labels'] = df1['ID'].map(labels)
    

    enter image description here