rforcats

Using fct_collapse on a subset of data


I am attempting to build a prediction model. One of my features are identifiers for U.S. States and Territories. The original list has 62 unique values, and I was able to reduce those down to 5 values using fct_collapse.

dat <- tibble(state = c('AA', 'AE', 'AK', 'AL', 'AP', 'AR', 'AS', 'AZ',
                        'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'FM', 'GA',
                        'GU', 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY',
                        'LA', 'MA', 'MD', 'ME', 'MH', 'MI', 'MN', 'MO', 
                        'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
                        'None', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'PR',
                        'RI', 'SC', 'SD', 'TN', 'TX', 
                        'UNITED STATES MINOR OUTLYING ISLANDS', 'UT',
                        'VA', 'VI', 'VT', 'WA', 'WI', 'WV', 'WY'))
dat$census_region <- fct_collapse(dat$state,
    northeast = c("CT","ME","MA","NH","RI","VT","NJ","NY","PA"),
    midwest = c("IL","IN","MI","OH","WI","IA","KS","MN","MO","NE","ND","SD"),
    south = c("DE","FL","GA","MD","NC","SC","VA","DC","WV","AL","KY","MS","TN",
         "AR","LA","OK","TX"),
    west = c("AZ","CO","ID","MT","NV","NM","UT","WY","AK","CA","HI","OR","WA"),
    other = c("AA","AE","AP","AS","FM","GU","MH","None","PR",
         "UNITED STATES MINOR OUTLYING ISLANDS","VI"))

tail(dat,10)

A tibble: 10 x 2

state census_region
TX south
UNITED STATES MINOR OUTLYING ISLANDS other
UT west
VA south
VI other
VT northeast
WA west
WI midwest
WV south
WY west

I am now trying to validate the model, and the smaller dataset does not have all 62 unique state identifiers:

dat_2 <- tibble(state = c('ID', 'IL', 'IN', 'KS', 'KY',
                          'LA', 'MA', 'MD', 'ME', 'MH', 'MI', 'MN', 'MO', 
                          'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
                          'None', 'NV', 'NY', 'OH', 'OK'))

Now, if I attempt to use fct_collapse on the smaller dataset:

dat_2$census_region <- fct_collapse(dat_2$state,
    northeast = c("CT","ME","MA","NH","RI","VT","NJ","NY","PA"),
    midwest = c("IL","IN","MI","OH","WI","IA","KS","MN","MO","NE","ND","SD"),
    south = c("DE","FL","GA","MD","NC","SC","VA","DC","WV","AL","KY","MS","TN",
        "AR","LA","OK","TX"),
    west = c("AZ","CO","ID","MT","NV","NM","UT","WY","AK","CA","HI","OR","WA"),
    other = c("AA","AE","AP","AS","FM","GU","MH","None","PR",
        "UNITED STATES MINOR OUTLYING ISLANDS","VI"))

I get this:

Warning message: Unknown levels in f: CT, RI, VT, PA, WI, IA, SD, DE, FL, GA, SC, VA, DC, WV, AL, TN, AR, TX, AZ, CO, UT, WY, AK, CA, HI, OR, WA, AA, AE, AP, AS, FM, GU, PR, UNITED STATES MINOR OUTLYING ISLANDS, VI

I have done something similar by grouping the states and territories by Roman Numerals, as defined by the Office of Management and Budget. My goal is to reduce down from 62 dummy variables to something more manageable.

THE QUESTION: is there an option within the forcats package (more particularly fct_collapse) that will assign only those values that are found and skip the "Uknown levels"?


Solution

  • You could consider tackling this a different way and just do dat_2 |> left_join(dat) per below.

    This grabs the census_region from dat that matches the state in your smaller sample and keeps it as a factor.

    library(tidyverse)
    
    dat <- tibble(state = c('AA', 'AE', 'AK', 'AL', 'AP', 'AR', 'AS', 'AZ',
                            'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'FM', 'GA',
                            'GU', 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY',
                            'LA', 'MA', 'MD', 'ME', 'MH', 'MI', 'MN', 'MO', 
                            'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
                            'None', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'PR',
                            'RI', 'SC', 'SD', 'TN', 'TX', 
                            'UNITED STATES MINOR OUTLYING ISLANDS', 'UT',
                            'VA', 'VI', 'VT', 'WA', 'WI', 'WV', 'WY'))
    
    dat$census_region <- fct_collapse(dat$state,
                                      northeast = c("CT","ME","MA","NH","RI","VT","NJ","NY","PA"),
                                      midwest = c("IL","IN","MI","OH","WI","IA","KS","MN","MO","NE","ND","SD"),
                                      south = c("DE","FL","GA","MD","NC","SC","VA","DC","WV","AL","KY","MS","TN",
                                                "AR","LA","OK","TX"),
                                      west = c("AZ","CO","ID","MT","NV","NM","UT","WY","AK","CA","HI","OR","WA"),
                                      other = c("AA","AE","AP","AS","FM","GU","MH","None","PR",
                                                "UNITED STATES MINOR OUTLYING ISLANDS","VI"))
    
    dat_2 <- tibble(state = c('ID', 'IL', 'IN', 'KS', 'KY',
                              'LA', 'MA', 'MD', 'ME', 'MH', 'MI', 'MN', 'MO', 
                              'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
                              'None', 'NV', 'NY', 'OH', 'OK'))
    
    dat_2 |> left_join(dat)
    #> Joining, by = "state"
    #> # A tibble: 26 × 2
    #>    state census_region
    #>    <chr> <fct>        
    #>  1 ID    west         
    #>  2 IL    midwest      
    #>  3 IN    midwest      
    #>  4 KS    midwest      
    #>  5 KY    south        
    #>  6 LA    south        
    #>  7 MA    northeast    
    #>  8 MD    south        
    #>  9 ME    northeast    
    #> 10 MH    other        
    #> # … with 16 more rows
    

    Created on 2022-05-19 by the reprex package (v2.0.1)