I’m getting bizarre results when I intersect some point data with polygons using the sf
package in R, and while I think I’ve found a solution, I’m so baffled by the issue that I’d really appreciate it if someone could help me understand what’s going on and hence whether I’m missing any potential issues.
I have one dataset with just coordinates; I need to know what 2000 tract each point is in, so I’d like to intersect it with a 2000 tract shapefile in order to assign the correct tract IDs. My real data are confidential, so for an example I’m currently using the DC metro map, available here. My real data are just coordinates, not a shapefile, though, and so lack a projection initially. I'll remove the projection from the metro map to imitate my actual data (and duplicate the data so I can show my solution):
library(sf)
# open metro shapefile using file paths not shown here
metro <- read_sf(dsn = metro.fp, layer = metro.filename)
dim(metro)
[1] 98 15
# remove metro projection to mirror initial state of real data
st_crs(metro) <- ""
# duplicate this data for later example of my solution
metro2 <- metro
I’m using a shapefile with polygons of 2000 US Census tracts (in 2010 TIGER/Line geometry) downloaded from NHGIS.org. (If there's a way to share an example of this more easily, please let me know; I could use dput()
, but even combined with head()
, the output is enormous.) It’s using an ESRI projection, which I think is relevant to the issue at hand. I won't show the full output when I look at its projection because it's very long, but when you use st_crs()
it shows that this map is using USA_Contiguous_Albers_Equal_Area_Conic
with ID["ESRI",102003]
at the bottom.
When I convert my point dataset to the same projection as the polygons -- using st_crs()
instead of st_transform()
because my point data lack a projection to begin with -- and then intersect the two, I end up with this odd issue where all of my points are supposedly in Kansas (State fips of 20), even though these are all Metro stations throughout DC, Maryland, and Virginia:
# open tract map using file paths not shown here
tract00 <- read_sf(dsn = fp, layer = file.name)
# give metro same projection as tract map
st_crs(metro) <- st_crs(tract00)
# intersect the two
metro.tract <- st_intersection(x = metro,
y = tract00)
# now all states are kansas (state fips of 20):
table(metro.tract$STATEFP00)
20
98
If instead I give my initial metro dataset a random projection and then use st_transform
to put it in the same projection as the tract shapefile, then I correctly get that the stations exist in DC, Maryland, and Virginia (DC is 11, MD is 24, and VA is 51):
# now try again, first giving metro2 a random projection so I can use st_transform
st_crs(metro2) <- "WGS84"
# give metro2 same projection as tract map
metro2 <- st_transform(metro2, crs = st_crs(tract00))
# intersect the two
metro.tract2 <- st_intersection(x = metro2,
y = tract00)
# now the states are correct:
table(metro.tract2$STATEFP00)
11 24 51
40 26 32
Does anyone have any idea what’s going on? While my solution here generates results that look correct, because I don’t understand what the issue is exactly, I worry that I could be missing other potential problems with my real data.
My best guess is that this has something to do with the ESRI projection of the tract shapefile, since the sf documentation of st_transform says, “Extra care is needed with the ESRI Shapefile format, because WKT1 does not store axis order unambiguously.” But this is basically speculation on my part, and I’m not sure why my solution would work in this case.
Basically if anyone can tell me what’s going on or whether I’m missing any other potential issues, I’d appreciate it.
I will address your points in turn to clarify this issue. First, load required packages and create some representative data to use:
library(tigris) # For US census tracts
library(sf)
library(dplyr)
# Read tract data for DC. Project to ESRI:102003 as default CRS is NAD83:EPSG4269
tract00 <- tracts(state = "dc", year = 2000) %>%
st_transform("ESRI:102003")
tract00$geometry
# Geometry set for 188 features
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: 1610777 ymin: 306994.4 xmax: 1629412 ymax: 329361
# Projected CRS: USA_Contiguous_Albers_Equal_Area_Conic
# First 5 geometries:
# MULTIPOLYGON (((1617395 319019.9, 1617392 31902...
# MULTIPOLYGON (((1617767 319815.2, 1617772 31981...
# MULTIPOLYGON (((1616807 318539.2, 1616794 31854...
# MULTIPOLYGON (((1617619 318298.1, 1617601 31829...
# MULTIPOLYGON (((1618010 318832.6, 1618021 31883...
# DC metro data shp previously unzipped to working directory, downloaded from
# https://opendata.dc.gov/datasets/DCGIS::metro-stations-regional/explore
metro <- st_read("Metro_Stations_Regional.shp")
metro$geometry
# Geometry set for 98 features
# Geometry type: POINT
# Dimension: XY
# Bounding box: xmin: -77.49154 ymin: 38.76653 xmax: -76.84455 ymax: 39.11994
# Geodetic CRS: WGS 84
# First 5 geometries:
# POINT (-76.91147 38.82645)
# POINT (-77.05367 38.81415)
# POINT (-77.06081 38.80659)
# POINT (-77.07088 38.80043)
# POINT (-77.07521 38.79392)
Note the difference in the values between metro and tract00.
When I convert my point dataset to the same projection as the polygons ... I end up with this odd issue where all of my points are supposedly in Kansas...
The coordinate units for the DC metro dataset you linked to are decimal degrees, whereas the coordinate units for the tract data are metres. By running:
st_crs(metro) <- st_crs(tract00)
# Warning message:
# st_crs<- : replacing crs does not reproject data; use st_transform for that
you are defining those decimal degree values as metres. That's precisely why you get a warning if you do so. The result is your metro data get located > 1.6e6m away from where they should be longitudinally, and > 3.06e5m away latitudinally e.g. Dorothy and Toto are back in Kansas ;)
If instead I give my initial metro dataset a random projection and then use
st_transform
to put it in the same projection as the tract shapefile, then I correctly get that the stations exist in DC...
It turns out in this hypothetical, you got 'lucky' by assigning WGS84:EPSG4326 to the metro2 data. That's because we know that original metro file has WGS84 as its CRS. So when you ran:
st_crs(metro2) <- "WGS84"
metro2 <- st_transform(metro2, crs = st_crs(tract00))
the st_transform()
function went "I see metro is in decimal degrees and I know where it's located based on the CRS, so I will transform those decimal degree coordinate values to metres and locate them using the CRS of tract00."
However, as I stated, you got 'lucky'. In the case of your actual dataset, you have not indicated which CRS forms the basis of the coordinate values. This is an issue because geographic coordinate system (GCS) values look similar, but can vary greatly. Consider the following:
# Reload metro dataset
metro <- st_read("Metro_Stations_Regional.shp")
# Convert metro to df with 'unknown' coordinates to replicate your actual data
metro2 <- data.frame(id = 1:nrow(metro),
st_coordinates(metro))
head(metro2)
# id X Y
# 1 1 -76.91147 38.82645
# 2 2 -77.05367 38.81415
# 3 3 -77.06081 38.80659
# 4 4 -77.07088 38.80043
# 5 5 -77.07521 38.79392
# 6 6 -76.99537 38.86297
# Coordinates incorrectly assumed to be NAD27/EPSG:4267
p_4267 <- st_as_sf(metro2, coords = c("X", "Y"), crs = 4326) %>%
st_transform(4267) %>%
st_set_crs(4326)
# Warning message:
# st_crs<- : replacing crs does not reproject data; use st_transform for that
p_4267$geometry
# Geometry set for 98 features
# Geometry type: POINT
# Dimension: XY
# Bounding box: xmin: -77.49185 ymin: 38.76649 xmax: -76.84488 ymax: 39.11991
# Geodetic CRS: WGS 84
# First 5 geometries:
# POINT (-76.91179 38.82642)
# POINT (-77.05399 38.81412)
# POINT (-77.06114 38.80656)
# POINT (-77.0712 38.8004)
# POINT (-77.07553 38.79389)
# Return distance between original and incorrectly projected location
head(st_distance(metro, p_4267 , by_element = TRUE))
# Units: [m]
# [1] 28.61364 28.23506 28.21754 28.19191 28.18171 28.38108
Even though metro has the same CRS as p_4267, and the coordinate values look similar, they are in fact ~28m apart. In summary, you need to know which CRS formed the basis of your actual data prior to projecting.