I have a medium-sized data set (provided to me) that includes address information in R that I'm in the process of cleaning. There is information that I need to remove but I am unsure how to do so, as the information after the ZIP code itself is not static. Below is a sample:
addresses <- c("515 DUMMY 1 75253 69AP",
"1000 DUMMY 2 75211",
"3948 DUMMY 3 75217 69Q",
"4545 DUMMY 4 75217 MAP 68C")
In essence, I need to transform these addresses into the following format:
"515 DUMMY 1 75253",
"1000 DUMMY 2 75211",
"3948 DUMMY 3 75217",
"4545 DUMMY 4 75217"
Thanks in advance for any help you may be able to provide.
Seems a classic regex approach might be something like below. I'll add one more address with another 5-digit number (leading) to make sure we don't over-remove.
addresses <- c("515 DUMMY 1 75253 69AP",
"1000 DUMMY 2 75211",
"3948 DUMMY 3 75217 69Q",
"4545 DUMMY 4 75217 MAP 68C",
"45454 DUMMY 4 75217 MAP 68C")
sub("^(.+)\\b(\\d{5})\\b.*", "\\1\\2", addresses)
# [1] "515 DUMMY 1 75253" "1000 DUMMY 2 75211" "3948 DUMMY 3 75217" "4545 DUMMY 4 75217" "45454 DUMMY 4 75217"
Regex:
"^(.+)\\b(\\d{5})\\b.*"
^^^^^ something at the beginning of string,
so that we don't false-trigger on a 5-digit
house address (a little fragile)
^^^ ^^^ word boundaries
^^^^^^^^ exactly five digits ([0-9])
^^ anything else (discarded)
The (...)
are saved groups, and \\1\\2
restore those two groups.
Quick edit: I don't like having to double-backslash everything, so in a newer R with "raw strings", we can do
sub(r"{^(.+)\b(\d{5})\b.*}", r"{\1\2}", addresses)
I think it makes it a little easier to read, though we still need to mentally discard the leading/trailing braces (we can also use r"(..)"
, r"[..]"
, r"|..|"
).