I am trying to label rows based on ID and year, If ID is appeared in the years [2017,2018,2019]
and not appeared in [2020,2021,2022]
then need to label it as 1, else 0.
df1 = pd.DataFrame({'ID': ['AX1', 'AX1', 'AX1','AX1','AX1','AX1','AX2','AX2','AX2','AX3','AX3','AX4','AX4','AX4'],'year':[2017,2018,2019,2020,2021,2022,2019,2020,2022,2019,2020,2017,2018,2019]})
ID year
0 AX1 2017
1 AX1 2018
2 AX1 2019
3 AX1 2020
4 AX1 2021
5 AX1 2022
6 AX2 2019
7 AX2 2020
8 AX2 2022
9 AX3 2019
10 AX3 2020
11 AX4 2017
12 AX4 2018
13 AX4 2019
Expected output:
ID year label
0 AX1 2017 0
1 Ax1 2018 0
2 AX1 2019 0
3 AX1 2020 0
4 AX1 2021 0
5 AX1 2022 0
6 AX2 2019 0
7 AX2 2020 0
8 AX2 2022 0
9 AX3 2019 0
10 AX3 2020 0
11 AX4 2017 1
12 AX4 2018 1
13 AX4 2019 1
In the above example ID: AX4 is tagged as 1, because it's the only ID that appears in the first set of years [2017,2018,2019]
and not appear in the second set [2020,2021,2022]
.
How do I achieve this?
Group by ID, check if the wanted years are in and unwanted years are out using set operations and map the result back to df1
.
df1 = pd.DataFrame({
'ID': ['AX1', 'AX1', 'AX1','AX1','AX1','AX1','AX2','AX2','AX2','AX3','AX3','AX4','AX4','AX4'],
'year':[2017,2018,2019,2020,2021,2022,2019,2020,2022,2019,2020,2017,2018,2019]
})
# find group level labels by checking if all of 2017-19 and none of 2020-22 exist for each ID
labels = (
df1.groupby('ID')['year']
.apply(lambda g: {2017,2018,2019}.issubset(g) and not {2020,2021,2022}.intersection(g))
.astype(int)
)
# map group level labels to ID
df1['labels'] = df1['ID'].map(labels)
Another (a bit more readable code) is to cross-tabulate df1
and check the years across columns. pd.crosstab()
sorts the columns (in this case, years), so simple eq()
works.
# cross tabulate and check for years across columns
labels = (
pd.crosstab(df1['ID'], df1['year'])
.eq([1, 1, 1, 0, 0, 0], axis=1)
.all(axis=1)
.astype(int)
)
# map group level labels to ID
df1['labels'] = df1['ID'].map(labels)
df1
Yet another method to get the lookup labels is to make the membership test first on each year in the 'year'
column, then group by 'ID'
. The aggregator method should be all()
because we want the all years of each ID to satisfy the membership test.
labels = (
df1.eval('year in [2017, 2018, 2019] and year not in [2020,2021,2022]')
.groupby(df1['ID'])
.all()
.astype(int)
)
df1['labels'] = df1['ID'].map(labels)