rlabelspssr-haven

R - changing values to labels permanently in labelled data


I have worked with haven and sjlabelled to try and work with data labels included on sav files.

Here is some example data (the real data is much larger with many more variables, values, labels, etc., and all values occur numerous times):

library(sjlabelled)
col1 <- c("a", "b", "c")
col2 <- c(1, 2, 3)
df <- data.frame(col1, col2)
labels <- c("x", "y", "z")
df <- set_labels(df, col2, labels = labels)

I know I can use as_label to manipulate the data frame using labels, subsetting using these labels, etc. However, I want to replace the values with the labels because some functions/processes revert the data to values and drop the labels entirely. I haven't been able to pin down when this will occur.

Using the example data, I want the original data frame to end up as the following, but instead of defining a new data frame, to just overwrite the values with the labels:

col1 <- c("a", "b", "c")
col2 <- c("x", "y", "z") # these were the labels but are now the values
df <- data.frame(col1, col2)

Solution

  • The get_labels(x)[x] approach can cause problems when not all labels are included as values in the dataset, or if all values are missing (which can happen in survey data).

    sjlabelled::read_spss by default converts all atomic variables with value labels to factors. Given that these represent labelled categorical variables, it makes sense for the output variables to be returned as factors. All atomic variables without value labels are assumed to be continuous and return as is.

    sjlabelled::copy_labels can be used to return value and variables labels when they have been dropped.

    library(sjlabelled)
    
    # Create test data
    df <- data.frame(
      col1 = c("a", "b", "c"),
      col2 = c(1, 2, 3),
      col3 = c(NA, NA, NA)
    )
    
    df <- set_labels(df, col2, col3, labels = c("0" = "w", "1" = "x", "2" = "y", "3" = "z")) |>
      var_labels(
        col1 = "Var 1",
        col2 = "Var 2",
        col3 = "var 3"
      )
    
    
    ## Function to convert labelled variables to normal r factors
    labels_to_values <- function(x, ...) {
      
      if(!is.null(attr(x, "labels"))) {
        x <- factor(x, levels = attr(x, "labels"), labels = names(attr(x, "labels")))
      }
      
      return(x)
      
    }
    
    # This approach produces incorrect results / errors
    lapply(df[, 2:3], \(x) get_labels(x)[x])
    #> $col2
    #> [1] "w" "x" "y"
    #> 
    #> $col3
    #> [1] NA NA NA NA
    
    # This approach returns expected results
    df <- lapply(df, labels_to_values) |>
      data.frame() |>
      copy_labels(df)  
    
    df
    #>   col1 col2 col3
    #> 1    a    x <NA>
    #> 2    b    y <NA>
    #> 3    c    z <NA>
    
    str(df)
    #> 'data.frame':    3 obs. of  3 variables:
    #>  $ col1: chr  "a" "b" "c"
    #>   ..- attr(*, "label")= chr "Var 1"
    #>  $ col2: Factor w/ 4 levels "w","x","y","z": 2 3 4
    #>   ..- attr(*, "label")= chr "Var 2"
    #>   ..- attr(*, "labels")= Named num [1:4] 0 1 2 3
    #>   .. ..- attr(*, "names")= chr [1:4] "w" "x" "y" "z"
    #>  $ col3: Factor w/ 4 levels "w","x","y","z": NA NA NA
    #>   ..- attr(*, "label")= chr "var 3"
    #>   ..- attr(*, "labels")= Named num [1:4] 0 1 2 3
    #>   .. ..- attr(*, "names")= chr [1:4] "w" "x" "y" "z"
    

    Created on 2023-10-31 with reprex v2.0.2