python-3.xpandasdataframegroup-byclip

Set upperbound in a column for a specific group by using Python


I have a dataset given as such in Python:

#Load the required libraries
import pandas as pd


#Create dataset
data = {'ID': [1, 1, 1, 1, 1,1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3],
        'Salary': [1, 2, 3, 4, 5,6,7,8,9,10, 1, 2, 3,4,5,6, 1, 2, 3, 4,5,6,7,8],
        'Children': ['No', 'Yes', 'Yes', 'Yes', 'No','No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'No'],
        'Days': [123, 128, 66, 120, 141,123, 128, 66, 120, 141, 52,96, 120, 141, 52,96, 120, 141,123,15,85,36,58,89],
        }

#Convert to dataframe
df = pd.DataFrame(data)
print("df = \n", df)

The above dataframe looks as such :

enter image description here

Now, for every ID/group, I wish to set an upperbound for some value of 'Salary'.

For example,

For ID=1, the upperbound of 'Salary' should be set at 4

For ID=2, the upperbound of 'Salary' should be set at 3

For ID=3, the upperbound of 'Salary' should be set at 5

The net result needs to look as such:

enter image description here

Can somebody please let me know how to achieve this task in python?


Solution

  • Use custom function with mapping by helper dictionary in GroupBy.transform:

    d = {1:4, 2:3, 3:5}
    
    def f(x):
        x.iloc[:d[x.name]] = d[x.name]
        return x
    
    df['Salary'] = df.groupby('ID')['Salary'].transform(f)
    print (df)
        ID  Salary Children  Days
    0    1       4       No   123
    1    1       4      Yes   128
    2    1       4      Yes    66
    3    1       4      Yes   120
    4    1       5       No   141
    5    1       6       No   123
    6    1       7      Yes   128
    7    1       8      Yes    66
    8    1       9      Yes   120
    9    1      10       No   141
    10   2       3      Yes    52
    11   2       3      Yes    96
    12   2       3       No   120
    13   2       4      Yes   141
    14   2       5      Yes    52
    15   2       6      Yes    96
    16   3       5      Yes   120
    17   3       5      Yes   141
    18   3       5       No   123
    19   3       5      Yes    15
    20   3       5       No    85
    21   3       6      Yes    36
    22   3       7      Yes    58
    23   3       8       No    89
    

    Another idea is use GroupBy.cumcount for counter per ID, compared by mapped ID and if match set mapped Series by Series.mask:

    d = {1:4, 2:3, 3:5}
    
    s = df['ID'].map(d)
    df['Salary'] = df['Salary'].mask(df.groupby('ID').cumcount().lt(s), s)
    

    Or if counter column is in Salary is possible use:

    s = df['ID'].map(d)
    df['Salary'] = df['Salary'].mask(df['Salary'].le(s), s)
    

    print (df)
        ID  Salary Children  Days
    0    1       4       No   123
    1    1       4      Yes   128
    2    1       4      Yes    66
    3    1       4      Yes   120
    4    1       5       No   141
    5    1       6       No   123
    6    1       7      Yes   128
    7    1       8      Yes    66
    8    1       9      Yes   120
    9    1      10       No   141
    10   2       3      Yes    52
    11   2       3      Yes    96
    12   2       3       No   120
    13   2       4      Yes   141
    14   2       5      Yes    52
    15   2       6      Yes    96
    16   3       5      Yes   120
    17   3       5      Yes   141
    18   3       5       No   123
    19   3       5      Yes    15
    20   3       5       No    85
    21   3       6      Yes    36
    22   3       7      Yes    58
    23   3       8       No    89