rdplyrrecoder-haven

How to recode values in haven_labelled vectors in R


I am working with data imported from SPSS using the haven package, imported using read_sav().

The data exists in columns of class haven_labelled, which is somewhat similar to a factor in that it contains a value and a label but is different in other ways.

I want to recode the values in the data and associated label values.

Here is an example:

library(haven)
library(dplyr)
library(labelled)
library(tidyr)

x <- structure(list(q0015_0001 = structure(c(3, 5, NA, 3, 1, 2, NA, NA, 3, 4, 2, NA, 2, 2, 4, NA,
 4, 3, 3, 3, 3, 2, NA, NA, 2), label = "Menu Options/Variety", format.spss = "F8.2", labels = 
c(`Very Dissatisfied` = 1, Dissatisfied = 2, Neutral = 3, Satisfied = 4, `Very Satisfied` = 5), 
class = c("haven_labelled", "vctrs_vctr", "double")), q0015_0002 = structure(c(4, 4, NA, 5, 3, 3, 
NA, NA, 3, 4, 2, NA, 5, 2, 4, NA, 4, 3, 4, 4, 4, 4, NA, NA, 2), label = "Cleanliness", format.spss
 = "F8.2", labels = c(`Very Dissatisfied` = 1, Dissatisfied = 2, Neutral = 3, Satisfied = 4, `Very
 Satisfied` = 5), class = c("haven_labelled", "vctrs_vctr", "double")), q0015_0003 = 
structure(c(2, 2, NA, 3, 1, 2, NA, NA, 3, 4, 3, NA, 4, 3, 4, NA, 3, 2, 4, 4, 2, 2, NA, NA, 1),
 label = "Taste and Quality of Food", format.spss = "F8.2", labels = c(`Very Dissatisfied` = 1, 
Dissatisfied = 2, Neutral = 3, Satisfied = 4, `Very Satisfied` = 5), class = c("haven_labelled", 
"vctrs_vctr", "double"))), row.names = c(NA, -25L), class = c("tbl_df", "tbl", "data.frame"), 
label = "File created by user")

x

# A tibble: 25 x 3
#               q0015_0001          q0015_0002             q0015_0003
#                <dbl+lbl>           <dbl+lbl>              <dbl+lbl>
# 1  3 [Neutral]            4 [Satisfied]       2 [Dissatisfied]     
# 2  5 [Very Satisfied]     4 [Satisfied]       2 [Dissatisfied]     
# 3 NA                     NA                  NA                    
# 4  3 [Neutral]            5 [Very Satisfied]  3 [Neutral]          
# 5  1 [Very Dissatisfied]  3 [Neutral]         1 [Very Dissatisfied]
# 6  2 [Dissatisfied]       3 [Neutral]         2 [Dissatisfied]     
# 7 NA                     NA                  NA                    
# 8 NA                     NA                  NA                    
# 9  3 [Neutral]            3 [Neutral]         3 [Neutral]          
#10  4 [Satisfied]          4 [Satisfied]       4 [Satisfied]        
# ... with 15 more rows

To illustrate the column structure better

x$q0015_0001

#<labelled<double>[25]>: Menu Options/Variety
# [1]  3  5 NA  3  1  2 NA NA  3  4  2 NA  2  2  4 NA  4  3  3  3  3  2 NA NA  2
#
#Labels:
# value             label
#     1 Very Dissatisfied
#     2      Dissatisfied
#     3           Neutral
#     4         Satisfied
#     5    Very Satisfied

The data include values from 1 to 5, each with a corresponding label (i.e., 1 = "Very Dissatisfied", etc.). haven_labelled allows numeric or character values.

I wish to change the values from c(1, 2, 3, 4, 5) to c(-2, -1, 0, 1, 2) but preserve the labels in the same order (i.e., -2 = "Very Dissatisfied", etc.).

Label Old Value New Value
Very Dissatisfied 1 -2
Dissatisfied 2 -1
Neutral 3 0
Satisfied 4 1
Very Satisfied 5 2

The closest I have come is using dplyr::recode(). The labelled package is supposed to extend the dplyr::recode() method to work with labelled vectors [1], but I haven't noticed a difference with/without it being loaded.

dplyr::recode(x$q0015_0001,`1` = -2, `2` = -1, `3` = 0, `4` = 1, `5` = 2)

#<labelled<double>[25]>: Menu Options/Variety
# [1]  0  2 NA  0 -2 -1 NA NA  0  1 -1 NA -1 -1  1 NA  1  0  0  0  0 -1 NA NA -1
#
#Labels:
# value             label
#     1 Very Dissatisfied
#     2      Dissatisfied
#     3           Neutral
#     4         Satisfied
#     5    Very Satisfied

Notice that the values in the data changed as expected (3 became 0, 5 became 2, etc.) but not the label values. This means that if you were to attempt to use as_factor (the labelled vector equivalent to as.factor from the haven package) to reference the labels instead of the values, the labels will be incorrect. The effect on the data is further illustrated when viewing the values and labels together.

x %>% 
  mutate(across(starts_with("q0015"), 
  ~recode(., `1` = -2, `2` = -1, `3` = 0, `4` = 1, `5` = 2)))

# A tibble: 25 x 3
#q0015_0001             q0015_0002             q0015_0003
#<dbl+lbl>              <dbl+lbl>              <dbl+lbl>
#1  0                      1 [Very Dissatisfied] -1                    
#2  2 [Dissatisfied]       1 [Very Dissatisfied] -1                    
#3 NA                     NA                     NA                    
#4  0                      2 [Dissatisfied]       0                    
#5 -2                      0                     -2                    
#6 -1                      0                     -1                    
#7 NA                     NA                     NA                    
#8 NA                     NA                     NA                    
#9  0                      0                      0                    
#10  1 [Very Dissatisfied]  1 [Very Dissatisfied]  1 [Very Dissatisfied]
# ... with 15 more rows

As shown, the labels still map to the old values. In the recoded version, 1 and 2 are positive scores but still map to Very Dissatisfied/Dissatisfied, while -2, -1 and 0 are not recognized as labelled values.

Question How may I recode labelled vectors such that the data values and label values are updated together and labels are preserved/mapped to the new values?


Solution

  • It's ugly AF, but it does the job. Problem is that setting value labels is not straightforward. Package labelled offers functions for it, but these aren't "tidyverse-ready", i.e. they don't work within a mutate, nor do they allow for selecting variables with tidyselect helpers like starts_with.

    However, set_value_labels allos for passing a list where each list element carries the name of the variable you want to apply labels to and then the labels itself are provided as a named vector:

    x |>  
      mutate(across(starts_with("q0015"), 
                    ~dplyr::recode(., `1` = -2, `2` = -1, `3` = 0, `4` = 1, `5` = 2))) |> 
      set_value_labels(.labels = rep(list(c("Very Dissatisfied" = -2,
                                            "Dissatisfied" = -1,
                                            "Neutral" = 0,
                                            "Satisfied" = 1,
                                            "Very Satisfied" = 2)),
                                     x |> 
                                       select(starts_with("q0015")) |> 
                                       ncol()) |> 
                         setNames(nm = x |> 
                                    select(starts_with("q0015")) |> 
                                    names()))
    

    which gives:

    # A tibble: 25 × 3
       q0015_0001             q0015_0002          q0015_0003            
       <dbl+lbl>              <dbl+lbl>           <dbl+lbl>             
     1  0 [Neutral]            1 [Satisfied]      -1 [Dissatisfied]     
     2  2 [Very Satisfied]     1 [Satisfied]      -1 [Dissatisfied]     
     3 NA                     NA                  NA                    
     4  0 [Neutral]            2 [Very Satisfied]  0 [Neutral]          
     5 -2 [Very Dissatisfied]  0 [Neutral]        -2 [Very Dissatisfied]
     6 -1 [Dissatisfied]       0 [Neutral]        -1 [Dissatisfied]     
     7 NA                     NA                  NA                    
     8 NA                     NA                  NA                    
     9  0 [Neutral]            0 [Neutral]         0 [Neutral]          
    10  1 [Satisfied]          1 [Satisfied]       1 [Satisfied]        
    # … with 15 more rows
    # ℹ Use `print(n = ...)` to see more rows
    

    I was curious and checked with the package developer of the labelled package, and an alternative would be to write a small function for recoding and relabeling a single variable and then run this function within across:

    https://github.com/larmarange/labelled/issues/126