How might I capture all the years of age in a column with values like "20 to 24 years" for one group and "22 to 24 years" for another group? This will enable me to confirm I have all the working age (18-64) variable names captured in a tidycensus (R package) U.S. Census API query.
What I want is, for ages 20-24 in this example, a data frame that extracts the ages from label entries like "22 to 24 years"
:
MEN WOMEN ETHNORACE
18 18 BLACK
19 19 BLACK
20 20 BLACK
21 21 BLACK
22 22 BLACK
23 23 BLACK
24 BLACK
I can then easily create a data frame that has all the ages and compare to see if any are missing.
One can see at https://api.census.gov/data/2019/acs/acs5/variables.html that at least the American Community Survey (ACS) by the U.S. Census has age range fields with varying syntax (e.g. "20 years" and "22 to 24 years"):
tidycensus R package version 1.1
## Example rows from tidycensus using:
library(tidycensus)
library(magrittr)
library(stringr)
v19 <- load_variables(2019, "acs5", cache = TRUE)
v19 %>%
dplyr::filter(
str_detect(label, "18|20|24") &
concept %in% c("SEX BY AGE",
"SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)") &
grepl('FEMALE', toupper(label))
)
v19_Total_AndBlack_Age18_24 <-
v19 %>% dplyr::filter(
str_detect(label, "18|20|24") &
concept %in% c("SEX BY AGE",
"SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)") &
grepl('FEMALE', toupper(label)))
print(v19_Total_AndBlack_Age18_24)
name label concept
<chr> <chr> <chr>
1 B01001_031 Estimate!!Total:!!Female:!!18 and 19 years SEX BY AGE
2 B01001_032 Estimate!!Total:!!Female:!!20 years SEX BY AGE
3 B01001_034 Estimate!!Total:!!Female:!!22 to 24 years SEX BY AGE
4 B01001B_022 Estimate!!Total:!!Female:!!18 and 19 years SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
5 B01001B_023 Estimate!!Total:!!Female:!!20 to 24 years SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
...
In this example, I want to make sure every age from 18-24 for the Total and Black populations is present in a dataframe like the following - notice the use of the Census API names from the above's v19_Total_AndBlack_Age18_24
.
v19_Total_AndBlack_Age18_24 <-
get_acs(
year = 2019,
geography = "zcta",
variables = c(v19_Total_AndBlack_Age18_24$name)
)
Notice that Total "22 to 24 years" compares to Black "20 to 24 years".
Let's focus on dataframe v19_Total_AndBlack_Age18_24
above, which lists out the Census API names and labels for ages 18 - 24, and aim to confirm all years are present.
I can get all of the numbers in the ages with a regular expression via:
unlist(str_extract_all(v19_Total_AndBlack_Age18_24$label,"\\d{2}"))
[1] "18" "19" "20" "22" "24" "18" "19" "20" "24"
But my attempts to group by the category are failing, and I still need to get a vector that spans the age ranges when the word "to" appears as in "20 to 24".
v19_Total_AndBlack_Age18_24_grp <-
v19_Total_AndBlack_Age18_24 %>%
mutate(EthnoRace = case_when(
grepl('BLACK', concept) ~ "BLACK",
TRUE ~ "TOTAL"))
v19_Total_AndBlack_Age18_24_grp %>%
group_by(EthnoRace) %>%
mutate(ages = str_extract_all(label, "\\d{2"))
Error: Problem with `mutate()` column `ages`.
i `ages = str_extract_all(label, "\\d{2")`.
x Error in {min,max} interval. (U_REGEX_BAD_INTERVAL, context=`\d{2`)
i The error occurred in group 1: Group = "TOTAL".
First, for what ages, genders, and ethnoracial groups does one want data? This could be modified to only choose one gender. Gender_var
needs to appear in the label column at https://api.census.gov/data/2019/acs/acs5/variables.html (or can use load_variables()
like below when dataframe v19
is created).
Only have to set these for code to run.
min_age_desired <- 18
max_age_desired <- 24
Gender_var = c("MALE", "FEMALE")
EthnoRace_var = c("BLACK","TOTAL")
Now, let's create a QC data frame with all the ethnoracial groups and age groups we need.
library(arsenal)
library(dplyr)
library(stringr)
library(tidyr)
library(tidycensus)
options(scipen = 8)
has all the ages and ethnoracial groups one wants from the Census API
AGE_var = as.numeric(seq(min_age_desired, max_age_desired, 1))
all_grp_qc_frm <-
data.frame(
## dupli
expand.grid(EthnoRace_var,
Gender_var,
AGE_var
)
)
colnames(all_grp_qc_frm) <-
c("EthnoRace", "Gender", "AGE")
all_grp_qc_frm$AGE <- as.numeric(
all_grp_qc_frm$AGE)
all_grp_qc_frm$EthnoRace <- as.character(
all_grp_qc_frm$EthnoRace)
all_grp_qc_frm$Gender <- as.character(
all_grp_qc_frm$Gender)
all_grp_qc_frm <- all_grp_qc_frm %>%
arrange(EthnoRace,Gender,AGE)
print(all_grp_qc_frm)
EthnoRace Gender AGE
1 BLACK MALE 18
2 BLACK MALE 19
3 BLACK MALE 20
4 BLACK MALE 21
5 BLACK MALE 22
6 BLACK MALE 23
7 BLACK MALE 24
8 BLACK FEMALE 18
9 BLACK FEMALE 19
10 BLACK FEMALE 20
11 BLACK FEMALE 21
12 BLACK FEMALE 22
13 BLACK FEMALE 23
14 BLACK FEMALE 24
15 TOTAL MALE 18
16 TOTAL MALE 19
17 TOTAL MALE 20
18 TOTAL MALE 21
19 TOTAL MALE 22
20 TOTAL MALE 23
21 TOTAL MALE 24
22 TOTAL FEMALE 18
23 TOTAL FEMALE 19
24 TOTAL FEMALE 20
25 TOTAL FEMALE 21
26 TOTAL FEMALE 22
27 TOTAL FEMALE 23
28 TOTAL FEMALE 24
This is for the 2019 American Community Survey 5-year estimates
v19 <- load_variables(2019, "acs5", cache = TRUE)
There are many variables available through the Census API.
To subset, let's first get a vector with each age from 18 to 24 separated by a pipe.
working_age_vec <- paste0(seq(18,24,1), collapse = "|")
Notice I need to use the correct concept values to get Black and the Total population across ethnoracial groups.
v19_Total_And_EthnoRace_Age18_24 <-
v19 %>% dplyr::filter(
str_detect(label, working_age_vec) &
concept %in% c("SEX BY AGE",
"SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)") &
grepl('FEMALE|MALE', toupper(label)))
print(v19_Total_And_EthnoRace_Age18_24)
# A tibble: 12 x 3
name label concept
<chr> <chr> <chr>
1 B01001_007 Estimate!!Total:!!Male:!!18 and 19 years SEX BY AGE
2 B01001_008 Estimate!!Total:!!Male:!!20 years SEX BY AGE
3 B01001_009 Estimate!!Total:!!Male:!!21 years SEX BY AGE
4 B01001_010 Estimate!!Total:!!Male:!!22 to 24 years SEX BY AGE
5 B01001_031 Estimate!!Total:!!Female:!!18 and 19 years SEX BY AGE
6 B01001_032 Estimate!!Total:!!Female:!!20 years SEX BY AGE
7 B01001_033 Estimate!!Total:!!Female:!!21 years SEX BY AGE
8 B01001_034 Estimate!!Total:!!Female:!!22 to 24 years SEX BY AGE
9 B01001B_007 Estimate!!Total:!!Male:!!18 and 19 years SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
10 B01001B_008 Estimate!!Total:!!Male:!!20 to 24 years SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
11 B01001B_022 Estimate!!Total:!!Female:!!18 and 19 years SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
12 B01001B_023 Estimate!!Total:!!Female:!!20 to 24 years SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
Census_Total_AndBlack_Age18_24 <-
get_acs(
year = 2019,
geography = "zcta",
variables = c(v19_Total_AndBlack_Age18_24$name)
)
Get concept values like SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
and label values like Estimate!!Total:!!Male:!!10 to 14 years
.
Census_Total_AndBlack_Age18_24 <- left_join(
Census_Total_AndBlack_Age18_24,
v19 %>%
select(name, concept, label) %>%
rename(variable = name)
)
Regular expressions to extract ages and create a vector expressing the sequence from the lowest to highest age per range.
Census_Total_AndBlack_Age18_24_grp <-
Census_Total_AndBlack_Age18_24 %>%
distinct(label, concept) %>%
## regular expression to extract all the numbers in labels like
## Estimate!!Total:!!Male:!!5 to 9 years
mutate(ages = sapply(str_extract_all(label,"\\d{2}"),
function(x) paste(x,collapse=""))) %>%
mutate(start = str_sub(ages, 1, 2),
end = str_sub(ages, 3, 4)) %>%
mutate(
start = case_when(
is.na(start) ~ "99",
TRUE ~ start),
end = case_when(
is.na(end) ~ "99",
TRUE ~ end)) %>%
dplyr::filter(grepl('Female|Male', label)) %>%
mutate(Gender = case_when(
grepl('Female', label) ~ "FEMALE",
grepl('Male', label) ~ "MALE",
TRUE ~ "MISSING")) %>%
mutate(EthnoRace = case_when(
grepl('BLACK', concept) ~ "BLACK",
TRUE ~ "TOTAL")) %>%
mutate(end = case_when(
is.na(end) | end == "" ~ start,
TRUE ~ end))
Census_Total_AndBlack_Age18_24_grp_sum <- Census_Total_AndBlack_Age18_24_grp %>%
group_by(EthnoRace, Gender) %>%
summarize(AGE = as.numeric(unlist(purrr::map2(start, end, `:`)))) %>%
ungroup() %>%
distinct(EthnoRace, Gender, AGE)
Remember that the desired ethnoracial groups, gender groups, and ages are printed at the top of this solution.
arsenal::comparedf(all_grp_qc_frm, Census_Total_AndBlack_Age18_24_grp_sum)
[...]
Not shared: 0 variables and 0 observations.
Differences found in 0/3 variables compared.
0 variables compared have non-identical attributes.