I've got a large data frame in R consisting millions of unique addresses in a column (like 725 E 8TH ST).
In an effort to standardize addresses, I want to replace all the short street suffix abbreviations to their long form (changing E to East and ST to Street).
I am not looking to validate the addresses.
I found a package called campfin
that has a tibble of many predefined abbreviations and their long forms (see usps_street
).
I've tried to use the normal_address
and the expand_abbrev
functions to do this, but have been unsuccessful. Specifically, I tried normal_address(df$Address, abbs = usps_street)
and expand_abbrev(df$Address, abb = usps_street)
. However, these are not replacing the abbreviations and the Address column is remaining unchanged.
Does anyone know how to properly use these functions? Or are there any suggestions for replacing these things? I don't want to defined short-forms (like abb = c("ST" = "STREET") ) because there are too many of them. Thus, I thought usps_street
would be useful.
Here is a snippet of some of the address I'm looking at:
structure(list(Address = c("2771 NORTH THOMPSON RD NE", "10 BROOKWAY COURT",
"1000 WESTCLIFF AVE", "10009 NORTHEAST 145TH ST", "10021 PINECREST DRIVE",
"1003 WEST VICTORIA STREET"), Unit = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_),
City = c("ATLANTA", "MANSFIELD", "SAGINAW", "JONES", "PROVIDENCE VILLAGE",
"BRADY"), Country = c(NA, "US", "US", "US", "US", "US"),
Zip = c(30319, 76063, 76179, 73049, 76227, 76825), State = c("GA",
"TX", "TX", "OK", "TX", "TX")), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
It looks like you're trying to do the inverse operation that the usps_street
data frame was created for. For example, look at the abbreviations for "ST"
:
library(dplyr)
library(campfin)
filter(usps_street, abb == "ST")
# # A tibble: 3 × 2
# full abb
# <chr> <chr>
# 1 STR ST
# 2 STREET ST
# 3 STRT ST
This is designed to normalise in the following ways:
STR -> ST
STREET -> ST
STRT -> ST
However, you want to go the other way, meaning you have to select which of the longer values to map to.
I am going to assume that you want the longest value, so STREET in this case. We need to create a data frame where the mapping is:
ST -> STREET
STR -> STREET
STRT -> STREET
We can do this fairly straightforwardly:
abbs <- usps_street |>
mutate(
num_char = nchar(full)
) |>
group_by(abb) |>
mutate(
abb = if_else(num_char == max(num_char), abb, full),
# take the first longest if it's a tie
full = full[ num_char == max(num_char)][1]
)
For STREET this looks like:
abbs |>
filter(full == "STREET")
full abb num_char
<chr> <chr> <int>
1 STREET STR 3
2 STREET ST 6
3 STREET STRT 4
You can then use the campfin::expand_abbrev()
function:
dat |>
mutate(
address_expanded = expand_abbrev(
Address,
abb = setNames(abbs$full, abbs$abb)
)
)
# A tibble: 6 × 7
Address Unit City Country Zip State address_expanded
<chr> <chr> <chr> <chr> <dbl> <chr> <chr>
1 2771 NORTH THOMPSON RD NE NA ATLANTA NA 30319 GA 2771 NORTH THOMPSON ROAD NORTHEAST
2 10 BROOKWAY COURT NA MANSFIELD US 76063 TX 10 BROOKWAY COURT
3 1000 WESTCLIFF AVE NA SAGINAW US 76179 TX 1000 WESTCLIFF AVENUE
4 10009 NORTHEAST 145TH ST NA JONES US 73049 OK 10009 NORTHEAST 145TH STREET
5 10021 PINECREST DRIVE NA PROVIDENCE VILLAGE US 76227 TX 10021 PINECREST DRIVE
6 1003 WEST VICTORIA STREET NA BRADY US 76825 TX 1003 WEST VICTORIA STREET