rr-sftidycensustigristiger-census

Why is tidycensus area provided different that calculated by sf::_st_area()


I am using the tidycensus R package to pull in census data and geometries. I want to be able to calculate population densities and have the results match what I see on censusreporter.org. I am noticing a difference between the geography variables returned from tidycenus compared to what I calculate myself using the sf package sf::st_area() function.

library(tidyverse)
library(tidycensus)
census_api_key("my_api_key")
library(sf)
options(tigris_use_cache = TRUE)

pop_texas <-
  get_acs(geography = 'state',
      variables = "B01003_001", # Total Population
      year = 2020, 
      survey = 'acs5', 
      keep_geo_vars = TRUE,
      geometry = TRUE) %>%
filter(GEOID == '48') # Filter to Texas

Since I included the keep_geo_vars argument as TRUE it returned an ALAND column which I believe is the correct area for the geography returned in square meters (m^2).

> pop_texas$ALAND %>% format(big.mark=",")
[1] "676,680,588,914"

# Conversion to square miles
> (pop_texas$ALAND / 1000000 / 2.5899881) %>% format(big.mark=",")
[1] "261,267.8"

When I convert the ALAND amount to square miles I get the same number as shown on censusreporter.org:

enter image description here

I have also tried to calculate the area using the sf::st_area() function, but I get a different result:

> sf::st_area(pop_texas) %>% format(big.mark=",", scientific=FALSE)
[1] "688,276,954,146 [m^2]"

# Conversion to square miles
> (sf::st_area(pop_texas) / 1000000 / 2.5899881) %>%
+   as.numeric() %>%
+   format(big.mark=",", scientific=FALSE)
[1] "265,745.2"

Please let me know if there is something I am missing to reconcile these numbers. I would expect to get the same results either directly through tidycensus or calculating the area using sf::st_area().

Right now I am off by a lot:

> (pop_texas$ALAND - as.numeric(st_area(pop_texas)) ) %>%
+   format(big.mark=",")
[1] "-11,596,365,232"

Solution

  • If you want the "official" area of a shape like Texas you should always use the ALAND or published area value. st_area() is using geometry to calculate the area of the polygon which is always going to be a simplified and imperfect representation of Texas (or any other area). For smaller shapes (like Census tracts) the calculations will probably be pretty close; for larger shapes like states (especially those with complex coastal geography, like Texas) you're going to be further off.