I have a column in a data frame like this (it includes many more address):
Address |
---|
287 Andover Pl Robbinsville NJ 08691 |
1 Oxford Ct Princeton Jct NJ 08550 |
244 N Post Rd Princeton Jct NJ 08550 |
3 Gates Ct West Windsor NJ 08550 |
How can I separate these addresses into 2 columns, so that it has the number and street address in one column, and the city, state, zip in another like this:
Address 1 | Address 2 |
---|---|
287 Andover Pl | Robbinsville NJ 08691 |
1 Oxford Ct | Princeton Jct NJ 08550 |
244 N Post Rd | Princeton Jct NJ 08550 |
3 Gates Ct | West Windsor NJ 08550 |
I have tried doing this by separating everything by space from the right, but some towns include two words in their names, and thus this way does not work.
# put zip code into separate column
clean_df[['prop_addressLine1','Owner Zip']] = clean_df['Owner Address'].str.rsplit(' ', n=1, expand=True)
# separate street, city, and state into their own columns:
clean_df[['Owner Street','Owner City','Owner State']] = clean_df['prop_addressLine1'].str.rsplit(' ', n=2, expand=True)
I was thinking of separating everything into separate columns by space, but there are irregular spaces, and I have not figured out how to make it work.
One option is to use the street type abbreviations (Ct/Pl/Rd) as delimiters and to extract
the two parts:
df[['Address 1', 'Address 2']] = (df['Address'].str
.extract('^(.*?(?:Pl|Ct|Rd)) (.*)$')
)
If you want to handle more complex addresses, you'll need a parser. For example with usaddress
:
import usaddress
def address_split(addr):
address1 = []
address2 = []
flag = False
for val, k in usaddress.parse(addr):
if k == 'PlaceName':
flag = True
if flag:
address2.append(val)
else:
address1.append(val)
return ' '.join(address1), ' '.join(address2)
df['Address 1'], df['Address 2'] = list(zip(*map(address_split, df['Address'])))
Output:
Address Address 1 Address 2
0 287 Andover Pl Robbinsville NJ 08691 287 Andover Pl Robbinsville NJ 08691
1 1 Oxford Ct Princeton Jct NJ 08550 1 Oxford Ct Princeton Jct NJ 08550
2 244 N Post Rd Princeton Jct NJ 08550 244 N Post Rd Princeton Jct NJ 08550
3 3 Gates Ct West Windsor NJ 08550 3 Gates Ct West Windsor NJ 08550
Intermediate with usaddress.parse
:
# usaddress.parse('244 N Post Rd Princeton Jct NJ 08550')
[('244', 'AddressNumber'),
('N', 'StreetNamePreDirectional'),
('Post', 'StreetName'),
('Rd', 'StreetNamePostType'),
('Princeton', 'PlaceName'),
('Jct', 'PlaceName'),
('NJ', 'StateName'),
('08550', 'ZipCode')]