The st_read()
command from the sf
package in R
can take SQL queries (see example here) when reading in shapefiles, but I'd like to be able to include a subquery.
For example, there's a shapefile here with US states:
library(sf)
p <- './cb_2018_us_state_20m.shp' #Path
allStates <- st_read(p) #Reads full dataset
#Query for two states
q1 <- 'select * from \"cb_2018_us_state_20m\" where NAME in (\'Oregon\',\'California\')'
twoStates <- st_read(p,query = q1) #Works
How would this work if there was a separate data frame or vector of States that I wanted to include this as a subquery? I've tried (unsuccessfully) various versions of SQL's in
operator but it keeps returning syntax errors:
moreStateNames <- c('Oregon','California','Texas') #Vector of state names
q2 <- 'select * from \"cb_2018_us_state_20m\" where NAME in (\"moreStateNames\")'
moreStates <- st_read(p,query = q2) #Fails
Error: Query execution failed, cannot open layer. In addition: Warning message: In CPL_read_ogr(dsn, layer, query, as.character(options), quiet, : GDAL Error 1: "moreStateNames" not recognised as an available field.
dfStates <- data.frame(id=1:3,moreStateNames=moreStateNames) #Dataframe with state names
q3 <- 'select * from \"cb_2018_us_state_20m\" where NAME in (select moreStateNames from \"dfStates\")'
moreStates <- st_read(p,query = q3) #Fails
Error: Query execution failed, cannot open layer. In addition: Warning message: In CPL_read_ogr(dsn, layer, query, as.character(options), quiet, : GDAL Error 1: SQL Expression Parsing Error: syntax error, unexpected SELECT. Occurred around : m "cb_2018_us_state_20m" where NAME in (select moreStateNames from "dfStates")
Where am I messing up here? What are my options other than reading the entire shapefile and then filtering it post-hoc?
You need to expand the vector moreStateNames
into a comma-separated string in R before submitting the query. Using sprintf
saves you some of the hazzle with quotation marks:
q2 <- sprintf('select * from \"%s\" where NAME in (\'%s\')',
"cb_2018_us_state_20m",
paste(moreStateNames, collapse = "','")
)
results in:
## > q2
## [1] "select * from \"cb_2018_us_state_20m\" where NAME in ('Oregon','California','Texas')"
so that:
## > moreStates <- st_read(p,query = q2) # Works
## Reading query `select * from "cb_2018_us_state_20m" where NAME in ('Oregon','California','Texas')'
## from data source
## `full\path\to\cb_2018_us_state_20m\cb_2018_us_state_20m.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 3 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.5524 ymin: 25.84012 xmax: -93.53094 ymax: 46.26913
## Geodetic CRS: NAD83