I am attempting to build a prediction model. One of my features are identifiers for U.S. States and Territories. The original list has 62 unique values, and I was able to reduce those down to 5 values using fct_collapse.
dat <- tibble(state = c('AA', 'AE', 'AK', 'AL', 'AP', 'AR', 'AS', 'AZ',
'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'FM', 'GA',
'GU', 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY',
'LA', 'MA', 'MD', 'ME', 'MH', 'MI', 'MN', 'MO',
'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
'None', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'PR',
'RI', 'SC', 'SD', 'TN', 'TX',
'UNITED STATES MINOR OUTLYING ISLANDS', 'UT',
'VA', 'VI', 'VT', 'WA', 'WI', 'WV', 'WY'))
dat$census_region <- fct_collapse(dat$state,
northeast = c("CT","ME","MA","NH","RI","VT","NJ","NY","PA"),
midwest = c("IL","IN","MI","OH","WI","IA","KS","MN","MO","NE","ND","SD"),
south = c("DE","FL","GA","MD","NC","SC","VA","DC","WV","AL","KY","MS","TN",
"AR","LA","OK","TX"),
west = c("AZ","CO","ID","MT","NV","NM","UT","WY","AK","CA","HI","OR","WA"),
other = c("AA","AE","AP","AS","FM","GU","MH","None","PR",
"UNITED STATES MINOR OUTLYING ISLANDS","VI"))
tail(dat,10)
A tibble: 10 x 2
state | census_region |
---|---|
TX | south |
UNITED STATES MINOR OUTLYING ISLANDS | other |
UT | west |
VA | south |
VI | other |
VT | northeast |
WA | west |
WI | midwest |
WV | south |
WY | west |
I am now trying to validate the model, and the smaller dataset does not have all 62 unique state identifiers:
dat_2 <- tibble(state = c('ID', 'IL', 'IN', 'KS', 'KY',
'LA', 'MA', 'MD', 'ME', 'MH', 'MI', 'MN', 'MO',
'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
'None', 'NV', 'NY', 'OH', 'OK'))
Now, if I attempt to use fct_collapse on the smaller dataset:
dat_2$census_region <- fct_collapse(dat_2$state,
northeast = c("CT","ME","MA","NH","RI","VT","NJ","NY","PA"),
midwest = c("IL","IN","MI","OH","WI","IA","KS","MN","MO","NE","ND","SD"),
south = c("DE","FL","GA","MD","NC","SC","VA","DC","WV","AL","KY","MS","TN",
"AR","LA","OK","TX"),
west = c("AZ","CO","ID","MT","NV","NM","UT","WY","AK","CA","HI","OR","WA"),
other = c("AA","AE","AP","AS","FM","GU","MH","None","PR",
"UNITED STATES MINOR OUTLYING ISLANDS","VI"))
I get this:
Warning message:
Unknown levels in f
: CT, RI, VT, PA, WI, IA, SD, DE, FL, GA, SC, VA, DC, WV, AL, TN, AR, TX, AZ, CO, UT, WY, AK, CA, HI, OR, WA, AA, AE, AP, AS, FM, GU, PR, UNITED STATES MINOR OUTLYING ISLANDS, VI
I have done something similar by grouping the states and territories by Roman Numerals, as defined by the Office of Management and Budget. My goal is to reduce down from 62 dummy variables to something more manageable.
THE QUESTION: is there an option within the forcats
package (more particularly fct_collapse) that will assign only those values that are found and skip the "Uknown levels"?
You could consider tackling this a different way and just do dat_2 |> left_join(dat)
per below.
This grabs the census_region
from dat that matches the state
in your smaller sample and keeps it as a factor.
library(tidyverse)
dat <- tibble(state = c('AA', 'AE', 'AK', 'AL', 'AP', 'AR', 'AS', 'AZ',
'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'FM', 'GA',
'GU', 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY',
'LA', 'MA', 'MD', 'ME', 'MH', 'MI', 'MN', 'MO',
'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
'None', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'PR',
'RI', 'SC', 'SD', 'TN', 'TX',
'UNITED STATES MINOR OUTLYING ISLANDS', 'UT',
'VA', 'VI', 'VT', 'WA', 'WI', 'WV', 'WY'))
dat$census_region <- fct_collapse(dat$state,
northeast = c("CT","ME","MA","NH","RI","VT","NJ","NY","PA"),
midwest = c("IL","IN","MI","OH","WI","IA","KS","MN","MO","NE","ND","SD"),
south = c("DE","FL","GA","MD","NC","SC","VA","DC","WV","AL","KY","MS","TN",
"AR","LA","OK","TX"),
west = c("AZ","CO","ID","MT","NV","NM","UT","WY","AK","CA","HI","OR","WA"),
other = c("AA","AE","AP","AS","FM","GU","MH","None","PR",
"UNITED STATES MINOR OUTLYING ISLANDS","VI"))
dat_2 <- tibble(state = c('ID', 'IL', 'IN', 'KS', 'KY',
'LA', 'MA', 'MD', 'ME', 'MH', 'MI', 'MN', 'MO',
'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
'None', 'NV', 'NY', 'OH', 'OK'))
dat_2 |> left_join(dat)
#> Joining, by = "state"
#> # A tibble: 26 × 2
#> state census_region
#> <chr> <fct>
#> 1 ID west
#> 2 IL midwest
#> 3 IN midwest
#> 4 KS midwest
#> 5 KY south
#> 6 LA south
#> 7 MA northeast
#> 8 MD south
#> 9 ME northeast
#> 10 MH other
#> # … with 16 more rows
Created on 2022-05-19 by the reprex package (v2.0.1)