I have a dataframe with a column called "education" which has over 100 unique values. I want to create a new column and categorize "education" into four levels: high school, undergrad, masters, and doctorate.
The following example is oversimplified but this is what I don't want to do:
df['level'] = df['education'].map({'grade 9' : 'high school',
'grade 10' : 'high school',
'grade 11' : 'high school',
'BS' : 'undergrad',
'B.Com' : 'undergrad',
'BA' : 'undergrad',
'MBA' : 'masters',
'PhD' : 'doctorate'
})
The above worked well for another column but I find it too tedious in this instance. Is there a way that I could use an if-else statement with the following lists (or is there a better way)? Although suggested in a similar question, it would not be sufficient to use .startswith() because a bachelor's won't always start with "B", a master's won't always start with "M", etc.
high_school = ['grade 9', 'grade 10', 'grade 11', 'grade 12']
undergrad = ['BS', 'B.Com', 'B.Ed', 'BA', 'LLB']
masters = ['MA', 'MBA', 'M.Ed', 'MPA', 'LLM', 'MTech', 'M.Tech']
doctorate = ['PhD', 'MD', 'DO']
This is an example of my desired outcome:
education | level |
---|---|
BS | undergrad |
LLM | masters |
DO | doctorate |
5.47 | NaN |
grade 9 | high school |
MBA | masters |
John | NaN |
Thank you for your help in advance.
If you have these 4 arrays, you start with putting them in a dictionary like so:
EDUCATION_LEVELS = {
'high_school': ['grade 9', 'grade 10', 'grade 11', 'grade 12'],
'undergrad': ['BS', 'B.Com', 'B.Ed', 'BA', 'LLB'],
'masters': ['MA', 'MBA', 'M.Ed', 'MPA', 'LLM', 'MTech', 'M.Tech'],
'doctorate': ['PhD', 'MD', 'DO']
}
After that the easiest is to define a custom mapper and use apply
on the column:
def map_education(x):
for k, v in EDUCATION_LEVELS.items():
if x in v:
return k
return np.nan
df['education'].apply(map_education)
That should do the trick.