rgeospatial

Mismatches resulting from spatial join between two shapefiles


I have two Shape files with different names for the different regions of the world. I need the names of the polygons in countries which correspond to the names of the polygons in countries_GIFT. I have done this with other Shape files and it has been working without problems, but for some reason I am getting mismatches between both data frames.

For example an entry on Russia in the first shapefile is being identified as Estonia in the other shape file. When checking the x y coordinates, I can see that the point is indeed in Russia, but for some reason it is being extracted at something else.

I already checked the projection of both shapefiles and whether the regions overlap each other, so I am not sure what the problem is.

Here is my code

library(sf)
library(tidyverse)
library(GIFT)

setwd(dirname(rstudioapi::getActiveDocumentContext()$path))

sf_use_s2(FALSE)

countries <- st_read("../../../../locations/Shapefiles/Locations.shp")
countries <- st_transform(countries, crs = 4326)

GIFT_shapes <- GIFT_shapes() # retrieves all shapefiles by default

countries_GIFT <- GIFT_shapes() 
countries_GIFT_crs <- st_transform(countries_GIFT, crs = 4326)

gift_stand_countries <- st_join(countries_GIFT_crs, countries, join = st_intersects)

# Check if regions overlap
ggplot() +
  geom_sf(data = countries_GIFT_crs, fill = "lightblue", color = "black") +
  geom_sf(data = countries, fill = NA, color = "red")

And here is a plot of both shape files:

enter image description here

And the geometry of both df

head(countries$geometry)
Geometry set for 6 features 
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -178.8733 ymin: -14.54889 xmax: 74.89291 ymax: 71.33334
Geodetic CRS:  WGS 84
First 5 geometries:
MULTIPOLYGON (((71.00423 38.47541, 71.02091 38....
MULTIPOLYGON (((20.97472 59.96736, 20.98694 59....
MULTIPOLYGON (((-178.8139 51.8375, -178.7424 51...
MULTIPOLYGON (((19.59271 41.63155, 19.59014 41....
MULTIPOLYGON (((-1.630694 35.21958, -1.630972 3...

head(countries_GIFT_crs$geometry)
Geometry set for 6 features 
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -180 ymin: -90 xmax: 180 ymax: 83.6
Geodetic CRS:  WGS 84
First 5 geometries:
MULTIPOLYGON (((-81.81 24.54, -81.77 24.54, -81...
MULTIPOLYGON (((145.99 43.37, 146.01 43.36, 146...
MULTIPOLYGON (((140.87 51.42, 140.88 51.43, 140...
MULTIPOLYGON (((166.4939 -78.21118, 166.6582 -7...
MULTIPOLYGON (((145.99 43.37, 146.01 43.36, 146...

Many thanks in advance


Solution

  • It's difficult to point a finger at something concrete without having access to both sources, but you are most likely dealing with datasets with different resolutions and precisions, latter is also visible from your object's geometries, 71.00423 38.47541 vs -81.81 24.54 . Even with same/similar resolution and precision it would be highly unlikely to get a perfect match for shapes from different data sources, so intersection with neighbours is expected (+1 @VinceGreg's comment).

    Just to illustrate, we can take GIFT_shapes() for Estonia and check which mid-res shapes from GISCO are intersecting, along with a higher-res country outline for reference:

    library(sf)
    #> Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
    library(GIFT)
    library(ggplot2)
    
    # 2016; resolution: 1:20million
    countries <- giscoR::gisco_countries[, "CNTR_ID"]
    
    # Estonia
    GIFT_est <- GIFT_shapes(entity_ID = 11456)
    #> You are asking for the latest stable version of GIFT which is 3.2.
    #> ================================================================================
    
    intersecting_countries <- 
      st_join(
        countries,
        GIFT_est,
        left = FALSE
      ) |> 
      # crop with extended bbox for plot
      st_crop(st_bbox(GIFT_est) + c(-.2, -.2, .2, .2))
    #> Warning: attribute variables are assumed to be spatially constant throughout
    #> all geometries
    intersecting_countries[, 1:2]
    #> Simple feature collection with 3 features and 2 fields
    #> Geometry type: MULTIPOLYGON
    #> Dimension:     XY
    #> Bounding box:  xmin: 21.56417 ymin: 57.31555 xmax: 28.40378 ymax: 60.05231
    #> Geodetic CRS:  WGS 84
    #>     CNTR_ID entity_ID                       geometry
    #> 82       EE     11456 MULTIPOLYGON (((27.83301 59...
    #> 138      LV     11456 MULTIPOLYGON (((27.75134 57...
    #> 185      RU     11456 MULTIPOLYGON (((28.40378 59...
    
    ggplot() +
      geom_sf(data = intersecting_countries, aes(fill = CNTR_ID)) +
      geom_sf(data = GIFT_est, fill = "white", alpha = .6 ) +
      # 1:1million outline for reference
      geom_sf(data = giscoR::gisco_get_countries(resolution = "01", country = "Estonia"), color = "navy", fill = NA) +
      scale_fill_viridis_d() +
      coord_sf(expand = FALSE)
    

    Or you can just check https://gift.uni-goettingen.de/map .


    You might get desired result by first generating st_point_on_surface() points from polygons of one dataset (centroid is not guaranteed to end up in the polygon) and joining those to polygons from another:

    GIFT_est |> 
      st_point_on_surface() |> 
      st_join(giscoR::gisco_countries)
    #> Warning: st_point_on_surface assumes attributes are constant over geometries
    #> Warning in st_point_on_surface.sfc(st_geometry(x)): st_point_on_surface may not
    #> give correct results for longitude/latitude data
    #> Simple feature collection with 1 feature and 17 fields
    #> Geometry type: POINT
    #> Dimension:     XY
    #> Bounding box:  xmin: 25.50592 ymin: 58.62939 xmax: 25.50592 ymax: 58.62939
    #> Geodetic CRS:  WGS 84
    #>   entity_ID geo_entity  point_x  point_y     area    x_min    x_max    y_min
    #> 1     11456    Estonia 25.54518 58.67211 45483.44 21.76361 28.21478 57.51511
    #>      y_max entity_class    entity_type polygon_source CNTR_ID CNTR_NAME
    #> 1 59.82264     Mainland Political Unit           GADM      EE     Eesti
    #>   ISO3_CODE NAME_ENGL FID                  geometry
    #> 1       EST   Estonia  EE POINT (25.50592 58.62939)
    

    With projected CRS I would expect it to work quite well, with geographic coordinates that warning above should probably not be taken too lightly.