pythonpandasgroup-byunique

number every first unique piece in each group


In each group, each 1st unique item should be given a different number in new column 'num'.

I can form the groups but I don't know how to number the unique pieces.

Is there a way to do that ?

Unique numbers are:

AF=1 / CT=2 / RT=4 / CTS=4

data = {'ATEXT': ['AF', 'AF', '', '', 'CT', 'RT', '', 'AF', 'AF', 'CTS', 'AF', 'AF', 'AF', 'CT',       'AF', 'CT', 'AF', 'AF', 'AF', 'AF', 'RT', 'RT', '', '', 'AF', 'CT', 'CT', 'RT', 'AF', 'AF', 'CT']}
df = pd.DataFrame(data)
df

End Result should be:

Out[6]: 
   ATEXT num
0     AF   1
1     AF    
2           
3           
4     CT   2
5     RT   3
6           
7     AF   1
8     AF    
9    CTS   4
10    AF    
11    AF    
12    AF    
13    CT   2
14    AF    
15    CT    
16    AF    
17    AF    
18    AF    
19    AF    
20    RT   3
21    RT    
22          
23          
24    AF   1
25    CT   2
26    CT    
27    RT   3
28    AF    
29    AF    
30    CT
my idea (does not yet give a useful result) :

nl = df['ATEXT']!=("")
df['num'] = (df['ATEXT'].mask(nl)
        .groupby(nl)
        .where(~df.ATEXT.duplicated()
        )

Solution

  • IIUC, I using a few intermediate columns to help with logic. Instead of using factorize you could use map to assign your unique numbers.

    Try:

    df['CODE'] = df['ATEXT'].mask(df['ATEXT'] == '').factorize()[0] + 1
    df.loc[df['ATEXT'] == '', 'CODE'] = np.nan
    df['grp'] = df['ATEXT'].eq('').cumsum()
    df['Num'] = df.groupby('grp', group_keys=False).apply(lambda x: x.drop_duplicates('ATEXT'))['CODE']
    df['Num'] = df['Num'].fillna('')
    df[['ATEXT', 'Num']]
    

    Output:

       ATEXT  Num
    0     AF  1.0
    1     AF     
    2            
    3            
    4     CT  2.0
    5     RT  3.0
    6            
    7     AF  1.0
    8     AF     
    9    CTS  4.0
    10    AF     
    11    AF     
    12    AF     
    13    CT  2.0
    14    AF     
    15    CT     
    16    AF     
    17    AF     
    18    AF     
    19    AF     
    20    RT  3.0
    21    RT     
    22           
    23           
    24    AF  1.0
    25    CT  2.0
    26    CT     
    27    RT  3.0
    28    AF     
    29    AF     
    30    CT     
    

    Or, much like you were doing, let use transform with groupby to mask duplicates with where and duplicated in each group:

    df['Num_2'] = df.groupby('grp')['CODE'].transform(lambda x: x.where(~x.duplicated(keep='first')).fillna(''))
    df[['ATEXT', 'Num_2']]
    

    Output:

       ATEXT Num_2
    0     AF   1.0
    1     AF      
    2             
    3             
    4     CT   2.0
    5     RT   3.0
    6             
    7     AF   1.0
    8     AF      
    9    CTS   4.0
    10    AF      
    11    AF      
    12    AF      
    13    CT   2.0
    14    AF      
    15    CT      
    16    AF      
    17    AF      
    18    AF      
    19    AF      
    20    RT   3.0
    21    RT      
    22            
    23            
    24    AF   1.0
    25    CT   2.0
    26    CT      
    27    RT   3.0
    28    AF      
    29    AF      
    30    CT