I have these 2 shapefiles from the internet:
library(sf)
library(dplyr)
library(fs)
url_1 <- "https://www12.statcan.gc.ca/census-recensement/2021/geo/sip-pis/boundary-limites/files-fichiers/lfsa000b21a_e.zip"
temp_dir <- tempdir()
zip_file <- file.path(temp_dir, "shapefile.zip")
download.file(url_1, destfile = zip_file)
exdir_ <- tempfile(pattern = "shp_1")
unzip(zip_file, exdir = exdir_, junkpaths = TRUE)
fs::dir_tree(exdir_)
file_1 <- st_read(exdir_)
fsa <- st_transform(file_1, "+proj=longlat +datum=WGS84")
unlink(temp_dir, recursive = TRUE)
url_2 <- "https://www12.statcan.gc.ca/census-recensement/alternative_alternatif.cfm?l=eng&dispext=zip&teng=lada000b21a_e.zip&k=%20%20%20151162&loc=//www12.statcan.gc.ca/census-recensement/2021/geo/sip-pis/boundary-limites/files-fichiers/lada000b21a_e.zip"
download.file(url_2, destfile = "lada000b21a_e.zip")
unzip("lada000b21a_e.zip")
ada <- st_read("lada000b21a_e.shp")
For the same PRUID, I want to make a matrix which contains what % each polygon from shapefile 1 intersects with each polygon from shapefile 2.
library(sf)
library(dplyr)
calculate_intersection_percentages <- function(ada, fsa, pruid) {
ada_filtered <- ada %>% filter(PRUID == pruid)
fsa_filtered <- fsa %>% filter(PRUID == pruid)
ada_filtered <- st_transform(ada_filtered, st_crs(fsa_filtered))
results <- list()
total_fsas <- nrow(fsa_filtered)
completed_fsas <- 0
for (ada_id in ada_filtered$ADAUID) {
ada_single <- ada_filtered %>% filter(ADAUID == ada_id)
ada_area <- st_area(ada_single)
for (fsa_id in fsa_filtered$CFSAUID) {
fsa_single <- fsa_filtered %>% filter(CFSAUID == fsa_id)
intersection <- st_intersection(ada_single, fsa_single)
if (length(intersection) > 0) {
intersection_area <- st_area(intersection)
percentage <- as.numeric(intersection_area / ada_area * 100)
print(sprintf("ADA: %s, FSA: %s, Intersection = %.2f%%", ada_id, fsa_id, percentage))
results[[ada_id]][[fsa_id]] <- percentage
}
completed_fsas <- completed_fsas + 1
progress_percentage <- (completed_fsas / (total_fsas * nrow(ada_filtered))) * 100
print(sprintf("Progress: %.2f%% of FSA calculations completed", progress_percentage))
}
}
result_matrix <- do.call(rbind, lapply(results, function(x) {
unlist(x, use.names = TRUE)
}))
result_matrix[is.na(result_matrix)] <- 0
result_matrix <- result_matrix / rowSums(result_matrix) * 100
return(result_matrix)
}
I launch the function like this:
result <- calculate_intersection_percentages(ada, fsa, pruid = "10")
The code runs, but I am getting some very exact numbers which makes me think the intersection is not happening correctly:
20.00000000 20.00000000 20.00000000 20.00000000 20.00000000
Also, there are 35 FSA's in Newfoundland (https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_A#:~:text=and%20some%20libraries.-,Newfoundland%20and%20Labrador,35%20FSAs%20in%20this%20list.), but I am only seeing 5:
> head(result)
A0A A1B A1C A1E A1S
10010001 15.50990 26.73515 15.50990 26.73515 15.50990
10010002 23.85278 14.22083 23.85278 14.22083 23.85278
10010003 13.88354 29.17468 13.88354 29.17468 13.88354
10010004 20.84652 18.73022 20.84652 18.73022 20.84652
10010005 26.08074 10.87889 26.08074 10.87889 26.08074
10010006 20.00000 20.00000 20.00000 20.00000 20.00000
How can I correctly intersect these two shapefiles?
Just few interesting issues to start with, there might be more.
performance-wise you could do much better without those nested loops, growing that result list shouldn't actually matter too much, but to my understanding this pair-wise operation effectively disables all the gain you could get from spatial indexing. Though (for me) it's not so straightforward to properly test & measure, don't think we can simply disable indexing.
length(intersection) > 0
is probably not indicating the state you had in mind; when using st_intersection()
with your sf
objects, it returns another sf
object, so length()
is never 0, not even for non-intersecting geometries as it's not the number of intersections but number of columns; interestingly this actually pushes you to the right direction as your results list gets NULL
values for 0-area intersections, something you definitely would want when preparing inputs for rbind
; when there's no output from print()
, it's not because your if
-clause but because percentage
is NULL
. When testing with a smaller pre-filtered subsets (ada[1:5,]
, fsa[1:5,]
, both with PRUID == "10"
), final results
looks like this:
lobstr::tree(results)
<list>
├─10010001: <list>
│ ├─A0A: 36.6844068668399
│ ├─A0B: 63.3155931332246
│ ├─A0C<dbl [0]>:
│ ├─A0E<dbl [0]>:
│ └─A0G<dbl [0]>:
├─10010002: <list>
│ ├─A0A: 62.6409032955523
│ ├─A0B<dbl [0]>:
│ ├─A0C<dbl [0]>:
│ ├─A0E<dbl [0]>:
│ └─A0G<dbl [0]>:
├─10010003: <list>
│ ├─A0A: 32.2477921284301
│ ├─A0B: 0
│ ├─A0C<dbl [0]>:
│ ├─A0E<dbl [0]>:
│ └─A0G<dbl [0]>:
├─10010004: <list>
│ ├─A0A<dbl [0]>:
│ ├─A0B<dbl [0]>:
│ ├─A0C<dbl [0]>:
│ ├─A0E<dbl [0]>:
│ └─A0G<dbl [0]>:
└─10010005: <list>
├─A0A: 0
├─A0B<dbl [0]>:
├─A0C<dbl [0]>:
├─A0E<dbl [0]>:
└─A0G<dbl [0]>:
lapply(results, function(x) {unlist(x, use.names = TRUE)})
all NULL
values are dropped, resulting with a ragged list where item alignment for rbind()
is broken; as a result, the number of matrix columns is defined by the number of non-NULL
areas from the first outer-loop iteration, also rows with only NULL
values are dropped; while our small 5-row sf
objects should have resulted with 5x5 matrix, we end up with a 4x2:lapply(results, unlist, use.names = TRUE) |> lobstr::tree(show_attributes = TRUE)
<list>
├─10010001<dbl [2]>: 36.6844068668399, 63.3155931332246
│ └┄attr(,"names")<chr [2]>: "A0A", "A0B"
├─10010002: 62.6409032955523
│ └┄attr(,"names"): "A0A"
├─10010003<dbl [2]>: 32.2477921284301, 0
│ └┄attr(,"names")<chr [2]>: "A0A", "A0B"
├─10010004<dbl [0]>:
├─10010005: 0
┊ └┄attr(,"names"): "A0A"
└┄attr(,"names")<chr [5]>: "10010001", "10010002", "10010003", "10010004", "10010005"
lapply(results, unlist, use.names = TRUE) |> do.call(what = rbind)
A0A A0B
10010001 36.68441 63.31559
10010002 62.64090 62.64090
10010003 32.24779 0.00000
10010005 0.00000 0.00000
To continue with a possible solution, it seems that a frame from st_intersection()
pivoted wider should be quite close to what you are after.
Fetch data and load just a small subsets for testing:
library(sf)
#> Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
library(dplyr)
library(tidyr)
library(ggplot2)
# curl::curl_download(
# "https://www12.statcan.gc.ca/census-recensement/2021/geo/sip-pis/boundary-limites/files-fichiers/lfsa000b21a_e.zip",
# "lfsa000b21a_e.zip",
# quiet = FALSE)
#
# curl::curl_download(
# "https://www12.statcan.gc.ca/census-recensement/2021/geo/sip-pis/boundary-limites/files-fichiers/lada000b21a_e.zip",
# "lada000b21a_e.zip",
# quiet = FALSE)
# check layers
rbind(
st_layers("/vsizip/lfsa000b21a_e.zip/lfsa000b21a_e/"),
st_layers("/vsizip/lada000b21a_e.zip")
)
#> Driver: ESRI Shapefile
#> Driver: ESRI Shapefile
#> Available layers:
#> layer_name geometry_type features fields crs_name
#> 1 lfsa000b21a_e Polygon 1643 5 NAD83 / Statistics Canada Lambert
#> 2 lada000b21a_e Polygon 5433 4 NAD83 / Statistics Canada Lambert
# read relevant subsets for testing
fsa <- read_sf(
"/vsizip/lfsa000b21a_e.zip/lfsa000b21a_e/",
query = "SELECT CFSAUID, PRUID FROM lfsa000b21a_e WHERE PRUID = '10'",
quiet = TRUE)
nrow(fsa)
#> [1] 35
ada <- read_sf(
"/vsizip/lada000b21a_e.zip",
query = "SELECT ADAUID, PRUID FROM lada000b21a_e WHERE PRUID = '10'",
quiet = TRUE)
nrow(ada)
#> [1] 86
Area ratios of intersections, feel free to adjust scaling:
isect_area_ratio <- function(ada, fsa, pruid) {
fsa_filtered <-
fsa |>
filter(PRUID == pruid) |>
select(CFSAUID)
# add `area_ada` so it would be included in st_intersection() result
ada_filtered <-
ada |>
filter(PRUID == pruid) |>
select(ADAUID) |>
st_transform(st_crs(fsa_filtered)) |>
mutate(area_ada = st_area(geometry))
# intersection also includes geometries without area (points, linestrings, ..),
# leave those out;
# once we have ratios, we can drop geometries, pivot from long to wide and
# convert to matrix
area_ratio_m <-
st_intersection(ada_filtered, fsa_filtered) |>
filter(st_geometry_type(geometry) %in% c("POLYGON", "MULTIPOLYGON", "GEOMETRYCOLLECTION")) |>
mutate(area_ratio = (st_area(geometry) / area_ada) |> as.numeric()) |>
st_drop_geometry() |>
select(ADAUID, CFSAUID, area_ratio) |>
pivot_wider(names_from = CFSAUID, values_from = area_ratio, values_fill = 0) |>
tibble::column_to_rownames("ADAUID") |>
as.matrix()
# make sure matrix row and column order aligns with row order of input dataframes
area_ratio_m[ada_filtered$ADAUID, fsa_filtered$CFSAUID ]
}
Test:
tictoc::tic("isect_area_ratio")
m <- isect_area_ratio(ada, fsa, pruid = "10")
tictoc::toc()
#> isect_area_ratio: 9.14 sec elapsed
# expect 86x35 matrix
str(m)
#> num [1:86, 1:35] 0.367 0.626 0.322 0 0 ...
#> - attr(*, "dimnames")=List of 2
#> ..$ : chr [1:86] "10010001" "10010002" "10010003" "10010004" ...
#> ..$ : chr [1:35] "A0A" "A0B" "A0C" "A0E" ...
# upper left corner
m[1:10, 1:10]
#> A0A A0B A0C A0E A0G A0H A0J A0K A0L A0M
#> 10010001 0.3668441 0.6331559 0 0 0 0 0 0 0 0
#> 10010002 0.6264090 0.0000000 0 0 0 0 0 0 0 0
#> 10010003 0.3224779 0.0000000 0 0 0 0 0 0 0 0
#> 10010004 0.0000000 0.0000000 0 0 0 0 0 0 0 0
#> 10010005 0.0000000 0.0000000 0 0 0 0 0 0 0 0
#> 10010006 0.0000000 0.0000000 0 0 0 0 0 0 0 0
#> 10010007 0.0000000 0.0000000 0 0 0 0 0 0 0 0
#> 10010008 0.0000000 0.0000000 0 0 0 0 0 0 0 0
#> 10010009 0.0000000 0.0000000 0 0 0 0 0 0 0 0
#> 10010010 0.0000000 0.0000000 0 0 0 0 0 0 0 0
# check if rows add up to ~1
rowSums(m) |> summary()
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1 1 1 1 1 1
A bit closer look at a small subset of intersections, note how we end up with geometry collections and multipolygons, one for every intersecting geometry pair. Number of individual intersection polygons can be much higher.
ada |>
filter(ADAUID == "10010032") |>
mutate(area_ada = st_area(geometry)) |>
st_intersection(fsa) |>
filter(st_geometry_type(geometry) %in% c("POLYGON", "MULTIPOLYGON", "GEOMETRYCOLLECTION")) |>
mutate(area_ratio = (st_area(geometry) / area_ada) |> as.numeric()) |>
print() |>
ggplot() +
geom_sf(data = ada[ada$ADAUID == "10010032","geometry"]) +
geom_sf(aes(fill = CFSAUID), legend = FALSE) +
geom_sf_label(aes(label = scales::label_percent()(area_ratio))) +
facet_grid(rows = vars(ADAUID), cols = vars(CFSAUID)) +
guides(fill = "none")
#> Warning: attribute variables are assumed to be spatially constant throughout
#> all geometries
#> Simple feature collection with 2 features and 6 fields
#> Geometry type: GEOMETRY
#> Dimension: XY
#> Bounding box: xmin: 8952734 ymin: 2027053 xmax: 9015751 ymax: 2119991
#> Projected CRS: NAD83 / Statistics Canada Lambert
#> # A tibble: 2 × 7
#> ADAUID PRUID area_ada CFSAUID PRUID.1 geometry area_ratio
#> * <chr> <chr> [m^2] <chr> <chr> <GEOMETRY [m]> <dbl>
#> 1 10010032 10 3.49e9 A0A 10 MULTIPOLYGON (((8991045 … 0.890
#> 2 10010032 10 3.49e9 A0B 10 GEOMETRYCOLLECTION (POLY… 0.110
Created on 2024-09-23 with reprex v2.1.1