pandasnumpyranking-functions

rank for nan values based on group


I have dataframe with column d1 and now i am trying calculate 'out' column after ranking that column when there in 'nan' value with in a column.

  data_input = {'Name':['Renault', 'Renault', 'Renault', 'Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault'],
                  'type':['Duster', 'Duster', 'Duster','Duster','Duster','Duster','Duster','Triber','Triber','Triber','Triber','Triber','Triber','Triber'],
             'd1':['nan','10','10','10','nan','nan','20','20','nan','nan','30','30','30','nan']}  
    
    df_input = pd.DataFrame(data_input)
    
    data_out = {'Name':['Renault', 'Renault', 'Renault', 'Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault'],
                  'type':['Duster', 'Duster', 'Duster','Duster','Duster','Duster','Duster','Triber','Triber','Triber','Triber','Triber','Triber','Triber'],
             'd1':['nan','10','10','10','nan','nan','20','20','nan','nan','30','30','30','nan'],
             'out':[1,np.NaN,np.NaN,np.NaN,2,2,np.NaN,np.NaN,1,1,np.NaN,np.NaN,np.NaN,2]}  
    
    df_out = pd.DataFrame(data_out)

If in that particular group if nan appers before and after some values then rank should be in asscending. ex: rank for index-0 will be 1 and index-4&5 will be 2(because there is no after values in that group)

df_out["out"] = df_out.groupby(["Name","type"])['d1'].rank(method="first")

Solution

  • Use GroupBy.cumsum by consecutive mising values per groups:

    df_out['d1'] = pd.to_numeric(df_out['d1'], errors='coerce')
    
    
    m = df_out['d1'].isna()
    
    df_out["out1"] = (df_out.assign(a = (m & ~m.shift(fill_value=False)))
                            .groupby(["Name","type"])['a']
                            .cumsum()
                            .where(m))
    

    Alternative solution with boolean indexing:

    df_out["out1"] = (df_out.assign(a = (m & ~m.shift(fill_value=False)))[m]
                            .groupby(["Name","type"])['a']
                            .cumsum())