I have a dataframe with a column of open response strings that identify a state in the US (hopefully, this will become a closed ended question soon). I need to assign each response with a state name and am currently working with the following code.
alabama_cat = ["alabama", "al"]
alaska_cat = ["alaska", "ak"]
newyork_cat = ["new york", "ny", "newyork"]
state_cat = [alabama_cat, alaska_cat, newyork_cat]
#Conditions for categories
conditions = [
(survey['state'].str.lower().str.contains('|'.join(alabama_cat), na=False)),
(survey['state'].str.lower().str.contains('|'.join(alaska_cat), na=False)),
(survey['state'].str.lower().str.contains('|'.join(newyork_cat), na=False)),
]
#Names of categories
choices = ["Alabama", "Alaska", "New York"]
# categorize
survey['state_category'] = np.select(conditions, choices)
I'm wondering if there is an easier way to create the conditions variable and am hoping to find an automated way to run each state_cat through (survey['state'].str.lower().str.contains('|'.join(alabama_cat), na=False)). I need to run this process for every state and possibly territories and instances where people input other countries.
Thanks so much for any insight.
Instead of checking for every cat, you can just try to extract any of the cat, then use map. Something like this:
# map the codes to actual names
state_codes = {code:choice for cat,choice in zip(state_cat, choices)
for code in cat}
patt = '|'.join(state_codes.keys())
survey['state_category'] = survey['state'].str.extract(f'({patt})', expand=False).map(state_codes)