pythonpandasdataframegroup-bygrouping

Calculate new column value based on max for group in Pandas Dataframe


I have dataframe containing list of subjects + dates of dispensing, one subject has more Dates of Dispensing and one single Date of dispensing for one subject can occur several times. Here is example:

{'Subject': {1449: 'CZ100030006',
  1786: 'CZ100030006',
  1958: 'CZ100030006',
  1964: 'CZ100030006',
  4067: 'CZ100030006',
  4119: 'CZ100030006',
  4143: 'CZ100030006',
  4441: 'CZ100030006',
  4467: 'CZ100030006',
  4530: 'CZ100030006',
  4532: 'CZ100030006',
  4585: 'CZ100030006',
  4703: 'CZ100030006',
  4767: 'CZ100030006',
  4850: 'CZ100030006',
  4888: 'CZ100030006',
  4974: 'CZ100030006',
  4987: 'CZ100030006',
  5108: 'CZ100030006',
  5476: 'CZ100030006',
  9768: 'CZ100030005',
  9815: 'CZ100030005',
  9822: 'CZ100030005',
  9837: 'CZ100030005',
  9852: 'CZ100030005',
  9853: 'CZ100030005',
  9889: 'CZ100030005',
  9945: 'CZ100030005',
  10009: 'CZ100030005',
  10050: 'CZ100030005',
  10052: 'CZ100030005',
  10060: 'CZ100030005',
  11532: 'CZ100030005',
  11582: 'CZ100030005',
  11640: 'CZ100030005',
  11722: 'CZ100030005',
  13267: 'CZ100030005',
  13339: 'CZ100030005',
  13354: 'CZ100030005',
  13655: 'CZ100030005'},
 'Date Dispensed': {1449: datetime.date(2024, 7, 4),
  1786: datetime.date(2024, 7, 4),
  1958: datetime.date(2024, 6, 21),
  1964: datetime.date(2024, 6, 21),
  4067: datetime.date(2024, 9, 16),
  4119: datetime.date(2024, 9, 16),
  4143: datetime.date(2024, 7, 19),
  4441: datetime.date(2024, 7, 19),
  4467: datetime.date(2024, 7, 19),
  4530: datetime.date(2024, 7, 19),
  4532: datetime.date(2024, 9, 16),
  4585: datetime.date(2024, 7, 19),
  4703: datetime.date(2024, 10, 11),
  4767: datetime.date(2024, 7, 19),
  4850: datetime.date(2024, 7, 19),
  4888: datetime.date(2024, 7, 19),
  4974: datetime.date(2024, 10, 11),
  4987: datetime.date(2024, 9, 16),
  5108: datetime.date(2024, 10, 11),
  5476: datetime.date(2024, 10, 11),
  9768: datetime.date(2024, 7, 4),
  9815: datetime.date(2024, 7, 4),
  9822: datetime.date(2024, 8, 28),
  9837: datetime.date(2024, 7, 4),
  9852: datetime.date(2024, 7, 4),
  9853: datetime.date(2024, 7, 4),
  9889: datetime.date(2024, 8, 28),
  9945: datetime.date(2024, 7, 4),
  10009: datetime.date(2024, 7, 4),
  10050: datetime.date(2024, 7, 4),
  10052: datetime.date(2024, 8, 28),
  10060: datetime.date(2024, 8, 28),
  11532: datetime.date(2024, 6, 20),
  11582: datetime.date(2024, 6, 5),
  11640: datetime.date(2024, 6, 20),
  11722: datetime.date(2024, 6, 5),
  13267: datetime.date(2024, 9, 25),
  13339: datetime.date(2024, 9, 25),
  13354: datetime.date(2024, 9, 25),
  13655: datetime.date(2024, 9, 25)}}

So for Subject CZ100030005, there will be True in added column if in that row Dispensing date is 28AUG2024, because this is 2nd max value of date dispensed.

I am able to find max value per group maxima = df_cov.groupby('Subject')['Date Dispensed'].max(), but I am not able to find 2nd to max.

And I am not able to do 2nd step at all, i.e. to make new column True/False based on whether 2nd to max value equals/not equals to current row Date Dispensed value.

Can you advice please?


Solution

  • @Dmitry543 has the correct logic, but this should used groupby.transform and a comparison with itself in the function:

    # ensure datetime
    df['Date Dispensed'] = pd.to_datetime(df['Date Dispensed'])
    
    # find largest second(s) for each group
    df['new'] = (df.groupby('Subject')['Date Dispensed']
                   .transform(lambda x: x==x.drop_duplicates().nlargest(2).iloc[-1])
                )
    

    Output:

               Subject Date Dispensed    new
    1449   CZ100030006     2024-07-04  False
    1786   CZ100030006     2024-07-04  False
    1958   CZ100030006     2024-06-21  False
    1964   CZ100030006     2024-06-21  False
    4067   CZ100030006     2024-09-16   True
    4119   CZ100030006     2024-09-16   True
    4143   CZ100030006     2024-07-19  False
    4441   CZ100030006     2024-07-19  False
    4467   CZ100030006     2024-07-19  False
    4530   CZ100030006     2024-07-19  False
    4532   CZ100030006     2024-09-16   True
    4585   CZ100030006     2024-07-19  False
    4703   CZ100030006     2024-10-11  False
    4767   CZ100030006     2024-07-19  False
    4850   CZ100030006     2024-07-19  False
    4888   CZ100030006     2024-07-19  False
    4974   CZ100030006     2024-10-11  False
    4987   CZ100030006     2024-09-16   True
    5108   CZ100030006     2024-10-11  False
    5476   CZ100030006     2024-10-11  False
    9768   CZ100030005     2024-07-04  False
    9815   CZ100030005     2024-07-04  False
    9822   CZ100030005     2024-08-28   True
    9837   CZ100030005     2024-07-04  False
    9852   CZ100030005     2024-07-04  False
    9853   CZ100030005     2024-07-04  False
    9889   CZ100030005     2024-08-28   True
    9945   CZ100030005     2024-07-04  False
    10009  CZ100030005     2024-07-04  False
    10050  CZ100030005     2024-07-04  False
    10052  CZ100030005     2024-08-28   True
    10060  CZ100030005     2024-08-28   True
    11532  CZ100030005     2024-06-20  False
    11582  CZ100030005     2024-06-05  False
    11640  CZ100030005     2024-06-20  False
    11722  CZ100030005     2024-06-05  False
    13267  CZ100030005     2024-09-25  False
    13339  CZ100030005     2024-09-25  False
    13354  CZ100030005     2024-09-25  False
    13655  CZ100030005     2024-09-25  False
    

    explaining the logic:

    def f(x):
        unique = x.drop_duplicates()
        print(f'unique dates: {unique.tolist()}')
        top_2 = unique.nlargest(2)
        print(f'largest two: {top_2.tolist()}')
        print(f'equality to second largest ({top_2.iloc[-1]}):')
        print(x == top_2.iloc[-1])
        return x == top_2.iloc[-1]
    
    (df.groupby('Subject')['Date Dispensed']
       .transform(f)
    )
    

    Intermediates:

    unique dates: [Timestamp('2024-07-04 00:00:00'), Timestamp('2024-08-28 00:00:00'), Timestamp('2024-06-20 00:00:00'), Timestamp('2024-06-05 00:00:00'), Timestamp('2024-09-25 00:00:00')]
    largest two: [Timestamp('2024-09-25 00:00:00'), Timestamp('2024-08-28 00:00:00')]
    equality to second largest (2024-08-28 00:00:00):
    9768     False
    9815     False
    9822      True
    9837     False
    9852     False
    9853     False
    9889      True
    9945     False
    10009    False
    10050    False
    10052     True
    10060     True
    11532    False
    11582    False
    11640    False
    11722    False
    13267    False
    13339    False
    13354    False
    13655    False
    Name: CZ100030005, dtype: bool
    unique dates: [Timestamp('2024-07-04 00:00:00'), Timestamp('2024-06-21 00:00:00'), Timestamp('2024-09-16 00:00:00'), Timestamp('2024-07-19 00:00:00'), Timestamp('2024-10-11 00:00:00')]
    largest two: [Timestamp('2024-10-11 00:00:00'), Timestamp('2024-09-16 00:00:00')]
    equality to second largest (2024-09-16 00:00:00):
    1449    False
    1786    False
    1958    False
    1964    False
    4067     True
    4119     True
    4143    False
    4441    False
    4467    False
    4530    False
    4532     True
    4585    False
    4703    False
    4767    False
    4850    False
    4888    False
    4974    False
    4987     True
    5108    False
    5476    False
    Name: CZ100030006, dtype: bool