pythonpandasdata-cleaningcity

Clean up city names in python


in a dataframe, there is a 'City' column containing different city names as well as various formats/misspellings. Here are the unique values in the column:

array(['somerville', 'hyde park', 'lexington', 'brookline', 'wellesley',
       'dover ', 'newton', 'westford', 'boston', 'needham', 'arlington',
       'wayland', 'waltham', 'cambridge', 'chestnuthill', 'salisbury ',
       'waban', 'weston', 'neeham', 'auburndale', 'belmont', 'allston',
       'auberdale', 'bedford', 'dover', 'lawrence', 'wilmington',
       'woburn', 'braintree', 'acton', 'winchester', 'middleton',
       'west newton', 'watertown', 'newton center', 'northfield',
       'roslindale', 'westwood', 'groton', 'natick', 'concord',
       'chestnut hill', 'westborough', 'sudbury', 'sherborn', 'quincy',
       'burlington', 'andover', 'littleton', 'stoughton'], dtype=object)

I want to clean up only four cities names using mapping, and leave other city names unchanged.

I used this code below:

cities_names = (('Newton', ['west newton', 'newton center', 'chestnut hill', 'chestnuthill', 'waban', 'auberdale', 'auburndale']),
              ('Dover', ['dover ']), 
              ('Needham', ['neeham']), 
              ('Wellesley', ['wellesly']))

cities_map = {y:x[0] for x in cities_tup for y in x[1]}

df_MA.City = df_MA.City.map(cities_map)
df_MA.City.unique()

But the output is : array([nan, 'Dover', 'Newton', 'Needham'], dtype=object)

So basically, it changed all other city names into nan which is not I want. Are there any other methods or packages to clean up city names?


Solution

  • Use replace:

    df_MA['City'] = df_MA['City'].replace(cities_map)
    

    Or combine map and fillna:

    df_MA['City'] = df_MA['City'].map(cities_map).fillna(df_MA['City'])
    

    NB. Don't assign to df_MA.City, always use the square brackets notation: df_MA['City'].