pythonpandaspandas-groupbypandas-loc

Condition between duplicated values in a column


Every customer is duplicated when they have more than one plan. I want to set the status to the customer:

If they have every product with 'canceled_at' filled, the customer status is cancelled, but when it's not every product with the canceled_at filled, but at least one, the status is 'downgrade' because he lost a product.

customer|canceled_at|status
x       |3/27/2018  |
x       |           |
y       |2/2/2018   |
y       |2/2/2018   |
z       |1/1/2018   |
a       |           |      

I already have the canceled status, now i only need the downgrade

df['status']=(df.groupby('customer')['canceled_at'].
  transform(lambda x: x.notna().all()).map({True:'canceled'})).fillna(df.status)
customer|canceled_at|status
x       |3/27/2018  |downgrade
x       |           |downgrade
y       |2/2/2018   |canceled
y       |2/2/2018   |canceled
z       |1/1/2018   |canceled
a       |           |      

Solution

  • Here is possible compare column for no missing values and grouping by Series customer with GroupBy.transform and GroupBy.all, GroupBy.any for test all values Trues (all non missing) or at least one value not missing (any non missing) and pass it to numpy.select:

    g = df['canceled_at'].notna().groupby(df['customer'])
    m1 = g.transform('all')
    m2 = g.transform('any')
    
    df['status'] = np.select([m1, m2],['canceled','downgrade'], np.nan)
    print (df)
      customer canceled_at     status
    0        x   3/27/2018  downgrade
    1        x         NaN  downgrade
    2        y    2/2/2018   canceled
    3        y    2/2/2018   canceled
    4        z    1/1/2018   canceled
    5        a         NaN        nan
    

    Or:

    df['status'] = np.select([m1, m2],['canceled','downgrade'], '')
    print (df)
      customer canceled_at     status
    0        x   3/27/2018  downgrade
    1        x         NaN  downgrade
    2        y    2/2/2018   canceled
    3        y    2/2/2018   canceled
    4        z    1/1/2018   canceled
    5        a         NaN         
    

    If only NaNs groups need convert to downgrade:

    mask = df['canceled_at'].notna().groupby(df['customer']).transform('all')
    df['status'] = np.where(mask,'canceled','downgrade')
    print (df)
      customer canceled_at     status
    0        x   3/27/2018  downgrade
    1        x         NaN  downgrade
    2        y    2/2/2018   canceled
    3        y    2/2/2018   canceled
    4        z    1/1/2018   canceled
    5        a         NaN  downgrade