pandasdataframegroup-by

Quick way to create hundreds of new columns in Pandas dataframe


I have a Pandas dataframe that looks like:

Group_ID   feature1   feature2   label
1          3          2          0
1          5          7          0
1          2          4          1
1          9          9          1
1          2          0          1
2          4          1          1
2          8          8          0
2          5          5          0
3          0          9          1
3          4          7          1
3          2          3          0
3          7          2          0

and for each feature i, I would like to create a new feature called featurei_rel for i=1,2 using the following logic: it is given by feature i for that row divided by the mean of the smallest two feature i in the same group.

So for example, for feature1 row 1, the smallest two values in Group 1 are 2 and 2 (row 3 and 5 respectively), hence feature1_rel for row 1 is given by 3/((2+2)/2) = 3/2 = 1.5 and the desired result looks like

Group_ID   feature1   feature2   feature1_rel   feature2_rel   label
1          3          2          3/((2+2)/2)    2/((0+2)/2)    0
1          5          7          5/((2+2)/2)    7/((0+2)/2)    0
1          2          4          2//((2+2)/2)   4/((0+2)/2)    1
1          9          9          9//((2+2)/2)   9/((0+2)/2)    1
1          2          0          2/((2+2)/2)    0/((0+2)/2)    1
2          4          1          4/((4+5)/2)    1/((1+5)/2)    1
2          8          8          8/((4+5)/2)    8/((1+5)/2)    0
2          5          5          5/((4+5)/2)    5/((1+5)/2)    0
3          0          9          0/((0+2)/2)    9/((2+3)/2)    1
3          4          7          4/((0+2)/2)    7/((2+3)/2)    1
3          2          3          2/((0+2)/2)    3/((2+3)/2)    0
3          7          2          7/((0+2)/2)    2/((2+3)/2)    0

So here is what I have tried:

# create columns to find the smallest element in a group
df['feature1min'] = df.groupby('Group_ID')['feature1'].transform('min')
# create columns to find the second smallest element in a group
df['feature1min2'] = df.groupby('Group_ID')['feature1'].nsmallest(2)

df['feature1_rel'] = df['feature1']/((df['feature1min'] + df['feature1min2']) / 2)

However, in my actual dataset, I have hundreds of features and millions of rows, so I was wondering is there any fast way to do it, thank you so much in advance.


Solution

  • IIUC, you can try:

    col = pd.Index(['feature1', 'feature2'])
    df[col+'_rel']= df.groupby('Group_ID')[col].transform(lambda x: x/x.nsmallest(2).mean())
    

    Output:

        Group_ID  feature1  feature2  label  feature1_rel  feature2_rel
    0          1         3         2      0      1.500000      2.000000
    1          1         5         7      0      2.500000      7.000000
    2          1         2         4      1      1.000000      4.000000
    3          1         9         9      1      4.500000      9.000000
    4          1         2         0      1      1.000000      0.000000
    5          2         4         1      1      0.888889      0.333333
    6          2         8         8      0      1.777778      2.666667
    7          2         5         5      0      1.111111      1.666667
    8          3         0         9      1      0.000000      3.600000
    9          3         4         7      1      4.000000      2.800000
    10         3         2         3      0      2.000000      1.200000
    11         3         7         2      0      7.000000      0.800000