rdplyrtidyversestringdistqdap

R: Group Similar Addresses Together


I have a 400,000 row file with manually entered addresses which need to be geocoded. There's a lot of different variations of the same addresses in the file, so it seems wasteful to be using API calls for the same address multiple times.

To cut down on this, I'd like to reduce these five rows:

    Address
    1 Main Street, Country A, World
    1 Main St, Country A, World
    1 Maine St, Country A, World
    2 Side Street, Country A, World
    2 Side St. Country A, World

down to two:

    Address
    1 Main Street, Country A, World
    2 Side Street, Country A, World

Using the stringdist package you can group the 'word' part of the strings together, but the string matching algorithms don't differentiate between the numbers. This means that it categorises two different houses numbers on the same street as the same address.

To work around this, I came up with two ways of doing it: firsly, trying to manually separate the numbers and the addresses into separate columns using regular expressions and rejoining them afterwards. The problem with this, is that with so many manually entered addresses, there seems to be hundreds of different edge cases and it gets unwieldy.

Using this answer on grouping and this on converting words to numbers, I have a second approach which deals with the edge cases but is incredibly expensive computationally. Is there a better third way of doing this?

library(gsubfn)
library(english)
library(qdap)
library(stringdist)
library(tidyverse)


similarGroups <- function(x, thresh = 0.8, method = "lv"){
  grp <- integer(length(x))
  Address <- x
  x <- tolower(x)
  for(i in seq_along(Address)){
    if(!is.na(Address[i])){
      sim <- stringdist::stringsim(x[i], x, method = method)
      k <- which(sim > thresh & !is.na(Address))
      grp[k] <- i
      is.na(Address) <- k
    }
  }
  grp
}

df <- data.frame(Address = c("1 Main Street, Country A, World", 
                             "1 Main St, Country A, World", 
                             "1 Maine St, Country A, World", 
                             "2 Side Street, Country A, World", 
                             "2 Side St. Country A, World"))

df1 <- df %>%
  # Converts Numbers into Letters
  mutate(Address = replace_number(Address),
         # Groups Similar Addresses Together
         Address = Address[similarGroups(Address, thresh = 0.8, method = "lv")],
         # Converts Letters back into Numbers
         Address = gsubfn("\\w+", setNames(as.list(1:1000), as.english(1:1000)), Address)
  ) %>%
  # Removes the Duplicates
  unique()

Solution

  • Might want to look into OpenRefine, or the refinr package for R, which is much less visual but still good. It has two functions, key_collision_merge and n_gram_merge which has several parameters. If you have a dictionary of good addresses, you can pass that to key_collision_merge.

    Probably good to make note of the abbreviations you see often (St., Blvd., Rd., etc.) and replace all of those. Surely there is a good table somewhere of these abbreviations, like https://www.pb.com/docs/US/pdf/SIS/Mail-Services/USPS-Suffix-Abbreviations.pdf.

    Then:

    library(refinr)    
    df <- tibble(Address = c("1 Main Street, Country A, World", 
                                 "1 Main St, Country A, World", 
                                 "1 Maine St, Country A, World", 
                                 "2 Side Street, Country A, World", 
                                 "2 Side St. Country A, World",
                                  "3 Side Rd. Country A, World",
                                  "3 Side Road Country B World"))
    df2 <- df %>%
      mutate(address_fix = str_replace_all(Address, "St\\.|St\\,|St\\s", "Street"),
             address_fix = str_replace_all(address_fix, "Rd\\.|Rd\\,|Rd\\s", "Road")) %>%
      mutate(address_merge = n_gram_merge(address_fix, numgram = 1))
    
    df2$address_merge
    [1] "1 Main Street Country A, World"
    [2] "1 Main Street Country A, World"
    [3] "1 Main Street Country A, World"
    [4] "2 Side Street Country A, World"
    [5] "2 Side Street Country A, World"
    [6] "3 Side Road Country A, World"  
    [7] "3 Side Road Country B World"