rmultiple-columnsdemographics

Combining Multiple Columns to Create a Single Variable


A dataset I am using recorded respondents' ethnicity. Responses are recorded across multiple variables, and respondents were allowed to pick more than one. Example:

Black White Asian Hispanic 

1     NA    NA    NA

NA    1     NA    NA

NA    NA    NA    1

NA    NA    1     1

^^^In the last row, the respondent would have chosen Asian and Hispanic.

What I want to do is:

A) collapse these columns into a single ethnicity variable, with different numbers representing different ethnicities (i.e., black would be 1, white would be 2 etc.)

B) have it so that anyone who reported multiple columns gets designated "multiple".

I'm a bit of an R novice, so any help would be greatly appreciated!


Solution

  • One way to do this is to pivot your column names as a column, group values by respondent, then drop the NA values. Then just choose the ethnicity value that remains for each group, switching to "multiple" when necessary. Here's a way to do that with tidyverse:

    library(tidyverse)
    
    df %>% 
      rownames_to_column("respondent") %>% 
      pivot_longer(-respondent) %>% 
      group_by(respondent) %>% 
      filter(!is.na(value)) %>% 
      summarise(eth = ifelse(n() == 1, name, "multiple"))
    
    # A tibble: 4 x 2
      respondent eth     
      <chr>      <chr>   
    1 1          Black   
    2 2          White   
    3 3          Hispanic
    4 4          multiple
    

    You won't be able to store numbers, as numeric types, with a string like "variable" - so you have a choice. Either stick with the ethnicity labels (like the solution above), or convert labels to numbers and then numbers to the string representations of those numbers. That seems a little unwieldy, but if you want to do that, here's how:

    df %>% 
      rownames_to_column("respondent") %>% 
      pivot_longer(-respondent) %>% 
      mutate(eth_num = as.character(as.numeric(fct_inorder(name)))) %>% 
      group_by(respondent) %>% 
      filter(!is.na(value)) %>% 
      summarise(eth = ifelse(n() == 1, eth_num, "multiple"))
    
    # A tibble: 4 x 2
      respondent eth     
      <chr>      <chr>   
    1 1          1       
    2 2          2       
    3 3          4       
    4 4          multiple