pythonpandascumsumaccumulate

python pandas create new string column cumulatively based on other column condition


Suppose I have a dataset (df)

Group | Employee_Title | Employee_Name  
A     | Manager        | John     
A     | Analyst        | Adam     
A     | Analyst        | Smith    
B     | Manager        | Bill    
B     | Analyst        | Ed    
B     | Analyst        | Jay 

I want to create a new column "Group_Manager" so that the new dataset would be:

Group | Employee_Title | Employee_Name | Group_Manager 
A     | Manager        | John          | John
A     | Analyst        | Adam          | John           
A     | Analyst        | Smith         | John    
B     | Manager        | Bill          | Bill    
B     | Analyst        | Ed            | Bill       
B     | Analyst        | Jay           | Bill  

I am looking for python code that can do this in some "cumulative" way, like (not working right now) :

df['Group_Manager']=df.groupby('Group').apply(lambda Employee_Title,Employee_Name: Employee_Name if Employee_Title=="Manager" else keep previous Group_Manager)

Solution

  • By retrieving the manager's name for each group and then reindex it according to the main dataframe 'Group' column, you can achieve the results that you desired

    import pandas as pd
    
    # Sample data
    data = {
        'Group': ['A', 'A', 'A', 'B', 'B', 'B'],
        'Employee_Title': ['Manager', 'Analyst', 'Analyst', 'Manager', 'Analyst', 'Analyst'],
        'Employee_Name': ['John', 'Adam', 'Smith', 'Bill', 'Ed', 'Jay']
    }
    df = pd.DataFrame(data)
    
    # Create the Group_Manager column
    df['Group_Manager'] = df.groupby('Group').apply(lambda g: g['Employee_Name'][g['Employee_Title'] == 'Manager'].iloc[0]).reindex(df['Group']).reset_index(drop=True)
    
    print(df)
    

    which results in

      Group Employee_Title Employee_Name Group_Manager
    0     A        Manager          John          John
    1     A        Analyst          Adam          John
    2     A        Analyst         Smith          John
    3     B        Manager          Bill          Bill
    4     B        Analyst            Ed          Bill
    5     B        Analyst           Jay          Bill
    

    Another approach using itertools.accumulate

    import pandas as pd
    import itertools
    
    # Sample data
    data = {
        'Group': ['A', 'A', 'A', 'B', 'B', 'B'],
        'Employee_Title': ['Manager', 'Analyst', 'Analyst', 'Manager', 'Analyst', 'Analyst'],
        'Employee_Name': ['John', 'Adam', 'Smith', 'Bill', 'Ed', 'Jay']
    }
    df = pd.DataFrame(data)
    
    # Create a series with manager names where title is 'Manager', and NaN elsewhere
    managers = df['Employee_Name'].where(df['Employee_Title'] == 'Manager', None)
    
    # Use accumulate to carry forward the manager names
    df['Group_Manager'] = list(itertools.accumulate(managers, lambda x, y: x if y is None else y))
    
    print(df)
    

    which results in

      Group Employee_Title Employee_Name Group_Manager
    0     A        Manager          John          John
    1     A        Analyst          Adam          John
    2     A        Analyst         Smith          John
    3     B        Manager          Bill          Bill
    4     B        Analyst            Ed          Bill
    5     B        Analyst           Jay          Bill