regexrqdapregex

With R, when string extraction creates a list element in a data frame, how can you add a row for each item in the list?


I have hundreds of addresses in a data frame variable and need to extract zip codes from them. Some of the addresses contain more than one city, each with a zip code. Here is a mock example of a data frame and R code that extracts the zip codes.

require(qdapRegex)
require(stringr)

df <- data.frame(address = c("Walnut; 94596, Ontario, 91761, Beach, CA 90071", "Irvine Cal 92164"), var2 = "text")
df$zip.Rinker <- sapply(df$address, FUN = rm_zip, extract=TRUE) 

The rm_zip function from Tyler Rinker's qdapRegex package extracts all the zip codes and puts them in list if there is more than one.

> df
                                         address var2          zip.Rinker
1 Walnut; 94596, Ontario, 91761, Beach, CA 90071 text 94596, 91761, 90071
2                               Irvine Cal 92164 text               92164

How can R create a new row for each zip code that is in row 1 under zip.Rinker? Something like the following would be ideal. Note, there will be dozens of addresses that have multiple zip codes, so I am hoping for a solution that doesn't require manual steps.

                                         address var2          zip.Rinker
1 Walnut; 94596, Ontario, 91761, Beach, CA 90071 text               94596
2 Walnut; 94596, Ontario, 91761, Beach, CA 90071 text               91761
3 Walnut; 94596, Ontario, 91761, Beach, CA 90071 text               90071
4                               Irvine Cal 92164 text               92164

Thank you for time.

PS Using stringr, this code extracts zip codes and presents the same challenge.

df$zip.stringr <- str_extract_all(string = df$address, pattern = "\\d{5}") 

Solution

  • You could do:

    data.frame(rep(df$address, sapply(df$zip.Rinker, length)), unlist(df$zip.Rinker)
    
    ##   rep.df.address..sapply.df.zip.Rinker..length.. unlist.df.zip.Rinker.
    ## 1 Walnut; 94596, Ontario, 91761, Beach, CA 90071                 94596
    ## 2 Walnut; 94596, Ontario, 91761, Beach, CA 90071                 91761
    ## 3 Walnut; 94596, Ontario, 91761, Beach, CA 90071                 90071
    ## 4                               Irvine Cal 92164                 92164
    

    But note that rm_zip is already vectorized and pretty speedy as it wraps the stringi package. So no need for sapply. Here's an approach that makes the code much more condensed using qdapTools's list2df that takes a named list of vectors and turns them into a data.frame.

    library(qdapTools)
    list2df(setNames(rm_zip(df$address, extract=TRUE), df$address), "zip", "address")[, 2:1]
    
    ##                                          address   zip
    ## 1 Walnut; 94596, Ontario, 91761, Beach, CA 90071 94596
    ## 2 Walnut; 94596, Ontario, 91761, Beach, CA 90071 91761
    ## 3 Walnut; 94596, Ontario, 91761, Beach, CA 90071 90071
    ## 4                               Irvine Cal 92164 92164
    

    And I like the magrittr framework for nested functions so here's that:

    library(qdapTools)
    library(magrittr)
    
    df$address %>%
        rm_zip(extract=TRUE) %>%
        setNames(df$address) %>%
        list2df("zip", "address") %>%
        `[`(, 2:1)