pythonpandasdata-cleaningstreet-address

How do I split addresses with no commas, and irregular layouts in a Pandas data frame?


I have a column in a data frame like this (it includes many more address):

Address
287 Andover Pl Robbinsville NJ 08691
1 Oxford Ct Princeton Jct NJ 08550
244 N Post Rd Princeton Jct NJ 08550
3 Gates Ct West Windsor NJ 08550

How can I separate these addresses into 2 columns, so that it has the number and street address in one column, and the city, state, zip in another like this:

Address 1 Address 2
287 Andover Pl Robbinsville NJ 08691
1 Oxford Ct Princeton Jct NJ 08550
244 N Post Rd Princeton Jct NJ 08550
3 Gates Ct West Windsor NJ 08550

I have tried doing this by separating everything by space from the right, but some towns include two words in their names, and thus this way does not work.

# put zip code into separate column
        clean_df[['prop_addressLine1','Owner Zip']] = clean_df['Owner Address'].str.rsplit(' ', n=1, expand=True)

        # separate street, city, and state into their own columns:
        clean_df[['Owner Street','Owner City','Owner State']] = clean_df['prop_addressLine1'].str.rsplit(' ', n=2, expand=True)

I was thinking of separating everything into separate columns by space, but there are irregular spaces, and I have not figured out how to make it work.


Solution

  • One option is to use the street type abbreviations (Ct/Pl/Rd) as delimiters and to extract the two parts:

    df[['Address 1', 'Address 2']] = (df['Address'].str
                                      .extract('^(.*?(?:Pl|Ct|Rd)) (.*)$')
                                     )
    

    If you want to handle more complex addresses, you'll need a parser. For example with usaddress:

    import usaddress
    
    def address_split(addr):
        address1 = []
        address2 = []
        flag = False
        for val, k in usaddress.parse(addr):
            if k == 'PlaceName':
                flag = True
            if flag:
                address2.append(val)
            else:
                address1.append(val)
        return ' '.join(address1), ' '.join(address2)
    
    
    df['Address 1'], df['Address 2'] = list(zip(*map(address_split, df['Address'])))
    

    Output:

                                    Address       Address 1               Address 2
    0  287 Andover Pl Robbinsville NJ 08691  287 Andover Pl   Robbinsville NJ 08691
    1    1 Oxford Ct Princeton Jct NJ 08550     1 Oxford Ct  Princeton Jct NJ 08550
    2  244 N Post Rd Princeton Jct NJ 08550   244 N Post Rd  Princeton Jct NJ 08550
    3      3 Gates Ct West Windsor NJ 08550      3 Gates Ct   West Windsor NJ 08550
    

    Intermediate with usaddress.parse:

    # usaddress.parse('244 N Post Rd Princeton Jct NJ 08550')
    
    [('244', 'AddressNumber'),
     ('N', 'StreetNamePreDirectional'),
     ('Post', 'StreetName'),
     ('Rd', 'StreetNamePostType'),
     ('Princeton', 'PlaceName'),
     ('Jct', 'PlaceName'),
     ('NJ', 'StateName'),
     ('08550', 'ZipCode')]