rstreet-address

Standardizing address formatting in R


I have a medium-sized data set (provided to me) that includes address information in R that I'm in the process of cleaning. There is information that I need to remove but I am unsure how to do so, as the information after the ZIP code itself is not static. Below is a sample:

addresses <- c("515 DUMMY 1 75253 69AP",
               "1000 DUMMY 2  75211",
               "3948 DUMMY 3 75217 69Q",
               "4545 DUMMY 4 75217 MAP 68C")

In essence, I need to transform these addresses into the following format:

"515 DUMMY 1 75253",
"1000 DUMMY 2  75211",
"3948 DUMMY 3 75217",
"4545 DUMMY 4 75217"

Thanks in advance for any help you may be able to provide.


Solution

  • Seems a classic regex approach might be something like below. I'll add one more address with another 5-digit number (leading) to make sure we don't over-remove.

    addresses <- c("515 DUMMY 1 75253 69AP",
                   "1000 DUMMY 2  75211",
                   "3948 DUMMY 3 75217 69Q",
                   "4545 DUMMY 4 75217 MAP 68C",
                   "45454 DUMMY 4 75217 MAP 68C")
    sub("^(.+)\\b(\\d{5})\\b.*", "\\1\\2", addresses)
    # [1] "515 DUMMY 1 75253"   "1000 DUMMY 2  75211" "3948 DUMMY 3 75217"  "4545 DUMMY 4 75217"  "45454 DUMMY 4 75217"
    

    Regex:

    "^(.+)\\b(\\d{5})\\b.*"
     ^^^^^                    something at the beginning of string,
                              so that we don't false-trigger on a 5-digit
                              house address (a little fragile)
          ^^^        ^^^      word boundaries
             ^^^^^^^^         exactly five digits ([0-9])
                        ^^    anything else (discarded)
    

    The (...) are saved groups, and \\1\\2 restore those two groups.

    Quick edit: I don't like having to double-backslash everything, so in a newer R with "raw strings", we can do

    sub(r"{^(.+)\b(\d{5})\b.*}", r"{\1\2}", addresses)
    

    I think it makes it a little easier to read, though we still need to mentally discard the leading/trailing braces (we can also use r"(..)", r"[..]", r"|..|").