I run into problems assigning a county to some city places. When querying via the acs
package
> geo.lookup(state = "NY", place = "New York")
state state.name county.name place place.name
1 36 New York <NA> NA <NA>
2 36 New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000 New York city
3 36 New York Oneida County 51011 New York Mills village
, you can see that "New York", for instance, has a bunch of counties. So do Los Angeles, Portland, Oklahoma, Columbus etc. How can such data be assigned to a "county"?
Following code is currently used to match "county.name" with the corresponding county FIPS code. Unfortunately, it only works for cases of only one county name output in the query.
dat <- c("New York, NY","Boston, MA","Los Angeles, CA","Dallas, TX","Palo Alto, CA")
dat <- strsplit(dat, ",")
dat
library(tigris)
library(acs)
data(fips_codes) # FIPS codes with state, code, county information
GeoLookup <- lapply(dat,function(x) {
geo.lookup(state = trimws(x[2]), place = trimws(x[1]))[2,]
})
df <- bind_rows(GeoLookup)
#Rename cols to match
colnames(fips_codes) = c("state.abb", "statefips", "state.name", "countyfips", "county.name")
# Here is a problem, because it works with one item in "county.name" but not more than one (see output below).
df <- df %>% left_join(fips_codes, by = c("state.name", "county.name"))
df
Returns:
state state.name county.name place place.name state.abb statefips countyfips
1 36 New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000 New York city <NA> <NA> <NA>
2 25 Massachusetts Suffolk County 7000 Boston city MA 25 025
3 6 California Los Angeles County 20802 East Los Angeles CDP CA 06 037
4 48 Texas Collin County, Dallas County, Denton County, Kaufman County, Rockwall County 19000 Dallas city <NA> <NA> <NA>
5 6 California San Mateo County 20956 East Palo Alto city CA 06 081
In order to retain data, the left_join might better be matched as "look for county.name
that contains place.name
(without the appending xy city in the name), or choose the first item by default. It would be great to see how this could be done.
In general: I assume, there's no better way than this approach?
Thanks for your help!
What about something like the code below to create a "long" data frame for joining. We use the tidyverse
pipe operator to chain operations. strsplit
returns a list, which we unnest
to stack the list values (the county names that go with each combination of state.name
and place.name
) into a long data frame where each county.name
now gets its own row.
library(tigris)
library(acs)
library(tidyverse)
dat = geo.lookup(state = "NY", place = "New York")
state state.name county.name place place.name 1 36 New York <NA> NA <NA> 2 36 New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000 New York city 3 36 New York Oneida County 51011 New York Mills village
dat = dat %>%
group_by(state.name, place.name) %>%
mutate(county.name = strsplit(county.name, ", ")) %>%
unnest
state state.name place place.name county.name <chr> <chr> <int> <chr> <chr> 1 36 New York NA <NA> <NA> 2 36 New York 51000 New York city Bronx County 3 36 New York 51000 New York city Kings County 4 36 New York 51000 New York city New York County 5 36 New York 51000 New York city Queens County 6 36 New York 51000 New York city Richmond County 7 36 New York 51011 New York Mills village Oneida County
UPDATE: Regarding the second question in your comment, assuming you have the vector of metro areas already, how about this:
dat <- c("New York, NY","Boston, MA","Los Angeles, CA","Dallas, TX","Palo Alto, CA")
df <- map_df(strsplit(dat, ", "), function(x) {
geo.lookup(state = x[2], place = x[1])[-1, ] %>%
group_by(state.name, place.name) %>%
mutate(county.name = strsplit(county.name, ", ")) %>%
unnest
})
df
state state.name place place.name county.name 1 36 New York 51000 New York city Bronx County 2 36 New York 51000 New York city Kings County 3 36 New York 51000 New York city New York County 4 36 New York 51000 New York city Queens County 5 36 New York 51000 New York city Richmond County 6 36 New York 51011 New York Mills village Oneida County 7 25 Massachusetts 7000 Boston city Suffolk County 8 25 Massachusetts 7000 Boston city Suffolk County 9 6 California 20802 East Los Angeles CDP Los Angeles County 10 6 California 39612 Lake Los Angeles CDP Los Angeles County 11 6 California 44000 Los Angeles city Los Angeles County 12 48 Texas 19000 Dallas city Collin County 13 48 Texas 19000 Dallas city Dallas County 14 48 Texas 19000 Dallas city Denton County 15 48 Texas 19000 Dallas city Kaufman County 16 48 Texas 19000 Dallas city Rockwall County 17 48 Texas 40516 Lake Dallas city Denton County 18 6 California 20956 East Palo Alto city San Mateo County 19 6 California 55282 Palo Alto city Santa Clara County
UPDATE 2: If I understand your comments, for cities (actually place names in the example) with more than one county, we want only the county that includes the same name as the city (for example, New York County in the case of New York city), or the first county in the list otherwise. The following code selects a county with the same name as the city or, if there isn't one, the first county for that city. You might have to tweak it a bit to make it work for the entire U.S. For example, for it to work for Louisiana, you might need gsub(" County| Parish"...
instead of gsub(" County"...
.
map_df(strsplit(dat, ", "), function(x) {
geo.lookup(state = x[2], place = x[1])[-1, ] %>%
group_by(state.name, place.name) %>%
mutate(county.name = strsplit(county.name, ", ")) %>%
unnest %>%
slice(max(1, which(grepl(sub(" [A-Za-z]*$","", place.name), gsub(" County", "", county.name))), na.rm=TRUE))
})
state state.name place place.name county.name <chr> <chr> <int> <chr> <chr> 1 36 New York 51000 New York city New York County 2 36 New York 51011 New York Mills village Oneida County 3 25 Massachusetts 7000 Boston city Suffolk County 4 6 California 20802 East Los Angeles CDP Los Angeles County 5 6 California 39612 Lake Los Angeles CDP Los Angeles County 6 6 California 44000 Los Angeles city Los Angeles County 7 48 Texas 19000 Dallas city Dallas County 8 48 Texas 40516 Lake Dallas city Denton County 9 6 California 20956 East Palo Alto city San Mateo County 10 6 California 55282 Palo Alto city Santa Clara County