pythonpandaslist

How to create a new column based on an existing column matching one of four lists?


I have a dataframe with a column called "education" which has over 100 unique values. I want to create a new column and categorize "education" into four levels: high school, undergrad, masters, and doctorate.

The following example is oversimplified but this is what I don't want to do:

df['level'] = df['education'].map({'grade 9' : 'high school', 
                                   'grade 10' : 'high school', 
                                   'grade 11' : 'high school', 
                                   'BS' : 'undergrad', 
                                   'B.Com' : 'undergrad', 
                                   'BA' : 'undergrad', 
                                   'MBA' : 'masters', 
                                   'PhD' : 'doctorate'
                                  })

The above worked well for another column but I find it too tedious in this instance. Is there a way that I could use an if-else statement with the following lists (or is there a better way)? Although suggested in a similar question, it would not be sufficient to use .startswith() because a bachelor's won't always start with "B", a master's won't always start with "M", etc.

high_school = ['grade 9', 'grade 10', 'grade 11', 'grade 12']
undergrad = ['BS', 'B.Com', 'B.Ed', 'BA', 'LLB']
masters = ['MA', 'MBA', 'M.Ed', 'MPA', 'LLM', 'MTech', 'M.Tech']
doctorate = ['PhD', 'MD', 'DO']

This is an example of my desired outcome:

education level
BS undergrad
LLM masters
DO doctorate
5.47 NaN
grade 9 high school
MBA masters
John NaN

Thank you for your help in advance.


Solution

  • If you have these 4 arrays, you start with putting them in a dictionary like so:

    EDUCATION_LEVELS = {
        'high_school': ['grade 9', 'grade 10', 'grade 11', 'grade 12'],
        'undergrad': ['BS', 'B.Com', 'B.Ed', 'BA', 'LLB'],
        'masters': ['MA', 'MBA', 'M.Ed', 'MPA', 'LLM', 'MTech', 'M.Tech'],
        'doctorate': ['PhD', 'MD', 'DO']  
        }
    

    After that the easiest is to define a custom mapper and use apply on the column:

    def map_education(x):
        for k, v in EDUCATION_LEVELS.items():
            if x in v:
                return k
        return np.nan
    
    df['education'].apply(map_education)
    

    That should do the trick.