I have dataframe containing list of subjects + dates of dispensing, one subject has more Dates of Dispensing and one single Date of dispensing for one subject can occur several times. Here is example:
{'Subject': {1449: 'CZ100030006',
1786: 'CZ100030006',
1958: 'CZ100030006',
1964: 'CZ100030006',
4067: 'CZ100030006',
4119: 'CZ100030006',
4143: 'CZ100030006',
4441: 'CZ100030006',
4467: 'CZ100030006',
4530: 'CZ100030006',
4532: 'CZ100030006',
4585: 'CZ100030006',
4703: 'CZ100030006',
4767: 'CZ100030006',
4850: 'CZ100030006',
4888: 'CZ100030006',
4974: 'CZ100030006',
4987: 'CZ100030006',
5108: 'CZ100030006',
5476: 'CZ100030006',
9768: 'CZ100030005',
9815: 'CZ100030005',
9822: 'CZ100030005',
9837: 'CZ100030005',
9852: 'CZ100030005',
9853: 'CZ100030005',
9889: 'CZ100030005',
9945: 'CZ100030005',
10009: 'CZ100030005',
10050: 'CZ100030005',
10052: 'CZ100030005',
10060: 'CZ100030005',
11532: 'CZ100030005',
11582: 'CZ100030005',
11640: 'CZ100030005',
11722: 'CZ100030005',
13267: 'CZ100030005',
13339: 'CZ100030005',
13354: 'CZ100030005',
13655: 'CZ100030005'},
'Date Dispensed': {1449: datetime.date(2024, 7, 4),
1786: datetime.date(2024, 7, 4),
1958: datetime.date(2024, 6, 21),
1964: datetime.date(2024, 6, 21),
4067: datetime.date(2024, 9, 16),
4119: datetime.date(2024, 9, 16),
4143: datetime.date(2024, 7, 19),
4441: datetime.date(2024, 7, 19),
4467: datetime.date(2024, 7, 19),
4530: datetime.date(2024, 7, 19),
4532: datetime.date(2024, 9, 16),
4585: datetime.date(2024, 7, 19),
4703: datetime.date(2024, 10, 11),
4767: datetime.date(2024, 7, 19),
4850: datetime.date(2024, 7, 19),
4888: datetime.date(2024, 7, 19),
4974: datetime.date(2024, 10, 11),
4987: datetime.date(2024, 9, 16),
5108: datetime.date(2024, 10, 11),
5476: datetime.date(2024, 10, 11),
9768: datetime.date(2024, 7, 4),
9815: datetime.date(2024, 7, 4),
9822: datetime.date(2024, 8, 28),
9837: datetime.date(2024, 7, 4),
9852: datetime.date(2024, 7, 4),
9853: datetime.date(2024, 7, 4),
9889: datetime.date(2024, 8, 28),
9945: datetime.date(2024, 7, 4),
10009: datetime.date(2024, 7, 4),
10050: datetime.date(2024, 7, 4),
10052: datetime.date(2024, 8, 28),
10060: datetime.date(2024, 8, 28),
11532: datetime.date(2024, 6, 20),
11582: datetime.date(2024, 6, 5),
11640: datetime.date(2024, 6, 20),
11722: datetime.date(2024, 6, 5),
13267: datetime.date(2024, 9, 25),
13339: datetime.date(2024, 9, 25),
13354: datetime.date(2024, 9, 25),
13655: datetime.date(2024, 9, 25)}}
So for Subject CZ100030005, there will be True in added column if in that row Dispensing date is 28AUG2024, because this is 2nd max value of date dispensed.
I am able to find max value per group maxima = df_cov.groupby('Subject')['Date Dispensed'].max()
, but I am not able to find 2nd to max.
And I am not able to do 2nd step at all, i.e. to make new column True/False based on whether 2nd to max value equals/not equals to current row Date Dispensed value.
Can you advice please?
@Dmitry543 has the correct logic, but this should used groupby.transform
and a comparison with itself in the function:
# ensure datetime
df['Date Dispensed'] = pd.to_datetime(df['Date Dispensed'])
# find largest second(s) for each group
df['new'] = (df.groupby('Subject')['Date Dispensed']
.transform(lambda x: x==x.drop_duplicates().nlargest(2).iloc[-1])
)
Output:
Subject Date Dispensed new
1449 CZ100030006 2024-07-04 False
1786 CZ100030006 2024-07-04 False
1958 CZ100030006 2024-06-21 False
1964 CZ100030006 2024-06-21 False
4067 CZ100030006 2024-09-16 True
4119 CZ100030006 2024-09-16 True
4143 CZ100030006 2024-07-19 False
4441 CZ100030006 2024-07-19 False
4467 CZ100030006 2024-07-19 False
4530 CZ100030006 2024-07-19 False
4532 CZ100030006 2024-09-16 True
4585 CZ100030006 2024-07-19 False
4703 CZ100030006 2024-10-11 False
4767 CZ100030006 2024-07-19 False
4850 CZ100030006 2024-07-19 False
4888 CZ100030006 2024-07-19 False
4974 CZ100030006 2024-10-11 False
4987 CZ100030006 2024-09-16 True
5108 CZ100030006 2024-10-11 False
5476 CZ100030006 2024-10-11 False
9768 CZ100030005 2024-07-04 False
9815 CZ100030005 2024-07-04 False
9822 CZ100030005 2024-08-28 True
9837 CZ100030005 2024-07-04 False
9852 CZ100030005 2024-07-04 False
9853 CZ100030005 2024-07-04 False
9889 CZ100030005 2024-08-28 True
9945 CZ100030005 2024-07-04 False
10009 CZ100030005 2024-07-04 False
10050 CZ100030005 2024-07-04 False
10052 CZ100030005 2024-08-28 True
10060 CZ100030005 2024-08-28 True
11532 CZ100030005 2024-06-20 False
11582 CZ100030005 2024-06-05 False
11640 CZ100030005 2024-06-20 False
11722 CZ100030005 2024-06-05 False
13267 CZ100030005 2024-09-25 False
13339 CZ100030005 2024-09-25 False
13354 CZ100030005 2024-09-25 False
13655 CZ100030005 2024-09-25 False
def f(x):
unique = x.drop_duplicates()
print(f'unique dates: {unique.tolist()}')
top_2 = unique.nlargest(2)
print(f'largest two: {top_2.tolist()}')
print(f'equality to second largest ({top_2.iloc[-1]}):')
print(x == top_2.iloc[-1])
return x == top_2.iloc[-1]
(df.groupby('Subject')['Date Dispensed']
.transform(f)
)
Intermediates:
unique dates: [Timestamp('2024-07-04 00:00:00'), Timestamp('2024-08-28 00:00:00'), Timestamp('2024-06-20 00:00:00'), Timestamp('2024-06-05 00:00:00'), Timestamp('2024-09-25 00:00:00')]
largest two: [Timestamp('2024-09-25 00:00:00'), Timestamp('2024-08-28 00:00:00')]
equality to second largest (2024-08-28 00:00:00):
9768 False
9815 False
9822 True
9837 False
9852 False
9853 False
9889 True
9945 False
10009 False
10050 False
10052 True
10060 True
11532 False
11582 False
11640 False
11722 False
13267 False
13339 False
13354 False
13655 False
Name: CZ100030005, dtype: bool
unique dates: [Timestamp('2024-07-04 00:00:00'), Timestamp('2024-06-21 00:00:00'), Timestamp('2024-09-16 00:00:00'), Timestamp('2024-07-19 00:00:00'), Timestamp('2024-10-11 00:00:00')]
largest two: [Timestamp('2024-10-11 00:00:00'), Timestamp('2024-09-16 00:00:00')]
equality to second largest (2024-09-16 00:00:00):
1449 False
1786 False
1958 False
1964 False
4067 True
4119 True
4143 False
4441 False
4467 False
4530 False
4532 True
4585 False
4703 False
4767 False
4850 False
4888 False
4974 False
4987 True
5108 False
5476 False
Name: CZ100030006, dtype: bool