I need to combine small geographic units (census tracts) with their neighbor with the smallest population in an iterative fashion until a population threshold (ex: >=200) is reached and the values for the combined units are summed (added).
The ultimate goal is to run an analysis of spatial autocorrelation but some units have such a small population that they should be combined with a neighbor (such as a merge in GIS that adds some of the values for the two features together). I also have some geographic units that are "islands" so have no neighbors. I'm working in R because I need code that can be re-used for multiple calculations but am also familiar with QGIS and Geoda so for "islands" with no neighbors I can manually alter the neighbors matrix to eliminate islands if needed but it would be better if instead I could write code to search for the nearest neighbor.
See sample data below of census tracts in Richmond County, NY (Staten Island) which has 7 tracts below a threshold of 200. Thanks in advance for your help!
library(tidycensus)
richmondtracts <- get_decennial(
geography = "tract",
state = "NY",
county = "Richmond",
variables = c(
Hispanic = "P2_002N" #Hispanic population per tract
),
summary_var = "P2_001N", #total population per tract
year = 2020,
geometry = TRUE
)
As a fairly new coder, I've been playing around with ChatGPT which has made some suggestions using while loops but they don't run. I'm a bit stuck but found this forum where geographic units are merged with a neighbor based on area: Merging small polygons with largest neighbour in R
TL;DR
It may be better to manually do the merges.
Preamble
The following answer works in limited situations. That’s because, depending on the range and spatial distribution of values for a given ethnicity, there may be an ‘unrealistic’ number of permutations to account for in order to automate your desired outcome. To find a merge match for all tracts < the threshold in all situations requires an algorithm more complex than the one outlined below. This stands even if restricting merges to the neighbouring tract with the lowest value.
I suggest examining the values of all ethnic groups at the same time to see if there are below threshold tracts that could be permanently merged. If that is not possible, a suggested workflow would be to first plot the tracts < the threshold and defining merge matches by ‘eyeballing’ the data and assigning merge ids manually. Then, pivot the data based on merge ids and use summarise()
to merge the tracts. In essence, this is the foundation of the following answer.
This method has only been tested on the outlined example data and is likely to fail in certain situations not limited to:
This method considers tracts with value == 0 as needing to be merged. Doing so tends to exacerbate the ecologically fallacious nature of census tract data. You will need to check how to manage this regarding your chosen analysis methods e.g. whether is it ok to include tracts with value == 0.
For simplicity, neighbours that only share a single node/vertex are ignored. Also, there are no non-contiguous tracts in your example data so this answer includes a version of richmondtracts with two tracts identified as islands.
The criteria of lowest value neighbour may not be the best approach given Tobler’s First Law. For instance, in the modified example data, the polygon with oid == 33 is matched with 32 and 29. Because 29 is further away than 55, it would arguably be more methodically sound to match 33 based on proximity.
Some steps will return warnings. They can be safely ignored for the purposes of this answer. The “although coordinates are longitude/latitude, st_intersection assumes that they are planar” warning is telling you st_intersection()
processes data as if were PCS and your data are defined by a GCS. For this solution, the returned results are fine. The “repeating attributes for all sub-geometries for which they may not be constant” warning is telling you that variables such as value and summary_value get repeated if for instance a MULTIPOLYGON gets split into two or more POLYGON features. In other words, do not sum value and/or summary_value if you join them back together.
Workflow
summarise()
by neighbour oid, and calculate border length. Create columns to determine whether a neighbour can be considered for merging:filter()
rows: if they do not sum to >= threshold only if there are other candidates; if duplicate matches do not have the longest border; and then return the match with minimum value only if a neighbour match existsarrange()
so slice_head()
returns lowest available value matchsummarise()
to merge matched tractslibrary(tidycensus)
library(sf)
library(dplyr)
library(tidyr)
library(lwgeom)
library(ggplot2)
# Original data
richmondtracts <- get_decennial(
geography = "tract",
state = "NY",
county = "Richmond",
variables = c(
Hispanic = "P2_002N" #Hispanic population per tract
),
summary_var = "P2_001N", #total population per tract
year = 2020,
geometry = TRUE) %>%
filter(!st_is_empty(.)) %>% # Remove empty geometries
mutate(oid = 1:n()) # Add rownames as integer, oid is basis for defining merges
### Modifications for illustrative purposes, skip to step 1 if using original data
# Modify richmondtracts to demonstrate permutations of islands and no neighbours > threshold
richmondtracts <- richmondtracts %>%
st_cast(., "POLYGON") %>%
mutate(oid = 1:n())
richmondtracts[95, c(1,2,4,5)] <- list("30000000000",
"Census Tract 000, Richmond County, New York",
1, 5)
richmondtracts[c(9,32,60),"value"] <- c(1,2,3)
###
# 1. Generate geometries for determining length of neighbouring polygon borders (for tie break)
# Generate POINT version of polygon data
p <- richmondtracts %>%
st_cast(., "POLYGON") %>%
st_cast(., "POINT") %>%
st_intersection()
# Generate LINESTRING version of polygon data
l <- richmondtracts %>%
st_cast(., "POLYGON") %>%
st_cast(., "LINESTRING")
# Split LINESTRING version to individual lines
spl <- st_split(l, p) %>%
st_collection_extract(., "LINESTRING") %>%
mutate(row.id = 1:n())
# 2. Merge islands to nearest nonisland
# Get oid of all islands
islands <- data.frame(st_intersects(richmondtracts)) %>%
group_by(col.id) %>%
filter(n() == 1) %>%
ungroup() %>%
### NOTE: Only for modified example data, skip this line if using original data
add_row(row.id = 35, col.id = 35)
if(nrow(islands) == 0) {
df <- richmondtracts
print("No non-contiguous tracts found, skip to step 6")
print("Optionally run step 4a for a pre-merge comparison plot")
} else {
print("Non-contiguous tracts found, continue on from here")
}
# If islands exist
islands <- richmondtracts %>%
filter(oid %in% islands[["row.id"]])
# Get all non-islands
nonisland <- richmondtracts %>%
filter(!oid %in% islands[["oid"]])
# 3. Find polygons in nonislands nearest to each island and assign each 'match' a unique merge_id
islands <- islands %>%
mutate(nearest = st_nearest_feature(islands, nonisland)) %>%
data.frame() %>%
group_by(oid) %>%
mutate(merge_id = cur_group_id() + nrow(richmondtracts)) %>%
ungroup() %>%
select(oid, nearest, merge_id) %>%
pivot_longer(-merge_id,
values_to = "oid") %>%
select(oid, merge_id)
# 4. Create new version of richmondtracts with islands merged to nearest nonislands
df <- richmondtracts %>%
left_join(., islands, by = "oid") %>%
mutate(merge_id = ifelse(is.na(merge_id), oid, merge_id)) %>%
group_by(merge_id) %>%
summarise(GEOID = first(GEOID),
oid = first(oid),
value = sum(value),
summary_value = sum(summary_value),
.groups = "drop") %>%
ungroup()
# 4a. Return polygons with values below threshold (for plot)
polysub <- richmondtracts %>%
filter(as.numeric(unlist(value)) < 200) %>%
mutate(row.id = as.integer(rownames(.)))
ggplot() +
geom_sf(data = richmondtracts,
colour = "blue",
fill = "grey90",
lwd = .5,
linetype = "dotted") +
geom_sf(data = polysub,
fill = "red") +
geom_sf(data = df,
color = "grey70",
fill = NA,
lwd = .5) +
geom_sf_text(data = filter(df, !oid %in% polysub[["oid"]]),
aes(label = oid),
size = 1,
fun.geometry = st_centroid,
show.legend = FALSE) +
geom_sf_text(data = polysub,
aes(label = oid),
size = 2,
fun.geometry = st_centroid,
show.legend = FALSE) +
theme(panel.background = element_rect(fill = "#e3eeff", colour = "#e3eeff"),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank(),
plot.margin=grid::unit(c(0,0,0,0), "mm"))
# 5. Merge tracts to ensure no polygon value < given threshold (e.g. 200)
# Regenerate geometries for determining length of polygon borders based on updated tracts.
p <- df %>%
st_cast(., "POLYGON") %>%
st_cast(., "POINT") %>%
st_intersection()
l <- df %>%
st_cast(., "POLYGON") %>%
st_cast(., "LINESTRING")
spl <- st_split(l, p) %>%
st_collection_extract(., "LINESTRING") %>%
mutate(row.id = 1:n())
# 6. Return polygons with values below threshold
polysub <- df %>%
filter(as.numeric(unlist(value)) < 200) %>%
mutate(row.id = as.integer(rownames(.)))
# Return LINESTRING version of polysub
linesub <- spl %>%
filter(.[["oid"]] %in% polysub[["oid"]]) %>%
mutate(row.id = 1:n())
# 7. Create new version of df with polygons < threshold merged to neighbour with lowest value
df1 <- linesub %>%
st_equals(., spl) %>%
data.frame() %>%
left_join(., select(data.frame(linesub), oid, row.id, value, summary_value),
by = join_by(row.id)) %>%
left_join(., spl, by = join_by(col.id == row.id),
relationship = "many-to-many") %>%
select(-col.id) %>%
filter(!oid.x == oid.y) %>%
st_as_sf() %>%
group_by(oid.x, oid.y) %>%
summarise(value = min(value.x),
value1 = min(value.y),
summary_value = min(summary_value.x),
summary_value1 = min(summary_value.y),
.groups = "drop") %>%
ungroup() %>%
mutate(length_m = st_length(.)) %>%
filter(!oid.x == oid.y) %>%
group_by(oid.y) %>%
mutate(threshold = ifelse((value + value1) >= 200, 1, 0),
dupe_match = n(),
dupe_rank = rank(length_m)) %>%
group_by(oid.x) %>%
mutate(candidates = any(sum(threshold) > 0)) %>%
filter(!threshold == 0 & candidates == TRUE | candidates == FALSE) %>%
filter(!dupe_rank < dupe_match & candidates == TRUE | candidates == FALSE) %>%
filter(value1 == min(value1) | candidates == FALSE) %>%
ungroup() %>%
mutate(value1 = ifelse(candidates == FALSE,
value1[match(oid.y, oid.x)],
value1),
oid.y = ifelse(candidates == FALSE,
oid.y[match(oid.y, oid.x)],
oid.y)) %>%
group_by(oid.x) %>%
arrange(oid.x, value1) %>%
slice_head() %>%
mutate(value1 = ifelse(candidates == FALSE, 0, value1)) %>%
data.frame() %>%
group_by(oid.y) %>%
mutate(merge_id = cur_group_id() + nrow(richmondtracts)) %>%
ungroup() %>%
select(oid.x, oid.y, merge_id) %>%
pivot_longer(-merge_id,
values_to = "oid") %>%
select(oid, merge_id) %>%
distinct() %>%
left_join(if("merge_id" %in% names(df)) { select(df, -merge_id) } else { df }, .,
by = "oid") %>%
mutate(merge_id = ifelse(is.na(merge_id), oid, merge_id)) %>%
group_by(merge_id) %>%
summarise(GEOID = first(GEOID),
oid = first(oid),
value = sum(value),
summary_value = sum(summary_value),
.groups = "drop") %>%
ungroup()
ggplot() +
geom_sf(data = df1,
color = NA,
fill = "grey90",
lwd = .5) +
geom_sf(data = filter(df1, merge_id > nrow(richmondtracts)),
aes(fill = factor(oid)),
lwd = .5,
show.legend = FALSE) +
geom_sf(data = df1,
color = "grey70",
fill = NA,
lwd = .5) +
geom_sf(data = polysub,
color = "blue",
fill = NA,
lwd = .75,
linetype = "dotted") +
geom_sf(data = richmondtracts,
color = "grey40",
fill = NA,
lwd = .4,
linetype = "dotted") +
geom_sf_text(data = filter(df1, merge_id > nrow(richmondtracts)),
aes(label = oid),
size = 2,
fun.geometry = st_centroid,
show.legend = FALSE) +
geom_sf_text(data = df,
aes(label = oid),
size = 1,
fun.geometry = st_centroid,
show.legend = FALSE) +
theme(panel.background = element_rect(fill = "#e3eeff", colour = "#e3eeff"),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank(),
plot.margin=grid::unit(c(0,0,0,0), "mm"))
Modified example data result pre and post merge:
Original data result pre and post merge: