rdplyrmatchgeocodingacs

R: Mission impossible? How to assign "New York" to a county


I run into problems assigning a county to some city places. When querying via the acs package

> geo.lookup(state = "NY", place = "New York")
  state state.name                                                                 county.name place             place.name
1    36   New York                                                                        <NA>    NA                   <NA>
2    36   New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000          New York city
3    36   New York                                                               Oneida County 51011 New York Mills village

, you can see that "New York", for instance, has a bunch of counties. So do Los Angeles, Portland, Oklahoma, Columbus etc. How can such data be assigned to a "county"?

Following code is currently used to match "county.name" with the corresponding county FIPS code. Unfortunately, it only works for cases of only one county name output in the query.

Script

dat <- c("New York, NY","Boston, MA","Los Angeles, CA","Dallas, TX","Palo Alto, CA")
dat <- strsplit(dat, ",")
dat

library(tigris)
library(acs)
data(fips_codes) # FIPS codes with state, code, county information

GeoLookup <- lapply(dat,function(x) {
  geo.lookup(state = trimws(x[2]), place = trimws(x[1]))[2,]
})

df <- bind_rows(GeoLookup)

#Rename cols to match
colnames(fips_codes) = c("state.abb", "statefips", "state.name", "countyfips", "county.name")

# Here is a problem, because it works with one item in "county.name" but not more than one (see output below).

df <- df %>% left_join(fips_codes, by = c("state.name", "county.name"))
df

Returns:

  state    state.name                                                                  county.name place           place.name state.abb statefips countyfips
1    36      New York  Bronx County, Kings County, New York County, Queens County, Richmond County 51000        New York city      <NA>      <NA>       <NA>
2    25 Massachusetts                                                               Suffolk County  7000          Boston city        MA        25        025
3     6    California                                                           Los Angeles County 20802 East Los Angeles CDP        CA        06        037
4    48         Texas Collin County, Dallas County, Denton County, Kaufman County, Rockwall County 19000          Dallas city      <NA>      <NA>       <NA>
5     6    California                                                             San Mateo County 20956  East Palo Alto city        CA        06        081

In order to retain data, the left_join might better be matched as "look for county.name that contains place.name (without the appending xy city in the name), or choose the first item by default. It would be great to see how this could be done.

In general: I assume, there's no better way than this approach?

Thanks for your help!


Solution

  • What about something like the code below to create a "long" data frame for joining. We use the tidyverse pipe operator to chain operations. strsplit returns a list, which we unnest to stack the list values (the county names that go with each combination of state.name and place.name) into a long data frame where each county.name now gets its own row.

    library(tigris)
    library(acs)  
    library(tidyverse)
    
    dat = geo.lookup(state = "NY", place = "New York")  
    
      state state.name                                                                 county.name place             place.name
    1    36   New York                                                                        <NA>    NA                   <NA>
    2    36   New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000          New York city
    3    36   New York                                                               Oneida County 51011 New York Mills village
    
    dat = dat %>% 
      group_by(state.name, place.name) %>% 
      mutate(county.name = strsplit(county.name, ", ")) %>% 
      unnest
    
      state state.name place             place.name     county.name
      <chr>      <chr> <int>                  <chr>           <chr>
    1    36   New York    NA                   <NA>            <NA>
    2    36   New York 51000          New York city    Bronx County
    3    36   New York 51000          New York city    Kings County
    4    36   New York 51000          New York city New York County
    5    36   New York 51000          New York city   Queens County
    6    36   New York 51000          New York city Richmond County
    7    36   New York 51011 New York Mills village   Oneida County
    

    UPDATE: Regarding the second question in your comment, assuming you have the vector of metro areas already, how about this:

    dat <- c("New York, NY","Boston, MA","Los Angeles, CA","Dallas, TX","Palo Alto, CA")
    
    df <- map_df(strsplit(dat, ", "), function(x) {
      geo.lookup(state = x[2], place = x[1])[-1, ] %>% 
        group_by(state.name, place.name) %>%
        mutate(county.name = strsplit(county.name, ", ")) %>%
        unnest
    })
    
    df
    
       state    state.name place             place.name        county.name
     1    36      New York 51000          New York city       Bronx County
     2    36      New York 51000          New York city       Kings County
     3    36      New York 51000          New York city    New York County
     4    36      New York 51000          New York city      Queens County
     5    36      New York 51000          New York city    Richmond County
     6    36      New York 51011 New York Mills village      Oneida County
     7    25 Massachusetts  7000            Boston city     Suffolk County
     8    25 Massachusetts  7000            Boston city     Suffolk County
     9     6    California 20802   East Los Angeles CDP Los Angeles County
    10     6    California 39612   Lake Los Angeles CDP Los Angeles County
    11     6    California 44000       Los Angeles city Los Angeles County
    12    48         Texas 19000            Dallas city      Collin County
    13    48         Texas 19000            Dallas city      Dallas County
    14    48         Texas 19000            Dallas city      Denton County
    15    48         Texas 19000            Dallas city     Kaufman County
    16    48         Texas 19000            Dallas city    Rockwall County
    17    48         Texas 40516       Lake Dallas city      Denton County
    18     6    California 20956    East Palo Alto city   San Mateo County
    19     6    California 55282         Palo Alto city Santa Clara County
    

    UPDATE 2: If I understand your comments, for cities (actually place names in the example) with more than one county, we want only the county that includes the same name as the city (for example, New York County in the case of New York city), or the first county in the list otherwise. The following code selects a county with the same name as the city or, if there isn't one, the first county for that city. You might have to tweak it a bit to make it work for the entire U.S. For example, for it to work for Louisiana, you might need gsub(" County| Parish"... instead of gsub(" County"....

    map_df(strsplit(dat, ", "), function(x) {
      geo.lookup(state = x[2], place = x[1])[-1, ] %>% 
        group_by(state.name, place.name) %>%
        mutate(county.name = strsplit(county.name, ", ")) %>%
        unnest %>% 
        slice(max(1, which(grepl(sub(" [A-Za-z]*$","", place.name), gsub(" County", "", county.name))), na.rm=TRUE))
    })
    
       state    state.name place             place.name        county.name
       <chr>         <chr> <int>                  <chr>              <chr>
     1    36      New York 51000          New York city    New York County
     2    36      New York 51011 New York Mills village      Oneida County
     3    25 Massachusetts  7000            Boston city     Suffolk County
     4     6    California 20802   East Los Angeles CDP Los Angeles County
     5     6    California 39612   Lake Los Angeles CDP Los Angeles County
     6     6    California 44000       Los Angeles city Los Angeles County
     7    48         Texas 19000            Dallas city      Dallas County
     8    48         Texas 40516       Lake Dallas city      Denton County
     9     6    California 20956    East Palo Alto city   San Mateo County
    10     6    California 55282         Palo Alto city Santa Clara County