rspss

How do I shorten super long variable labels of an SPSS file in R?


Context:

Problem: I would like to be able to remove the text from all these 200 variable labels that is the same, and be left with only the unique words at the end.

Where I'm at:

I'm able to locate all 200 variable names of interest here because they all contain "RANK", with the below:

colnames(
  look_for_and_select(data,
                      "RANK",labels=FALSE,
                      values=FALSE,
                      ignore.case=FALSE),
  )

Everything I'm finding in my searches are functions to replace or add a new variable labels, but I need to be able to retain the unique text already in the label. I think maybe there's a way to use gsub()... but would appreciate any ideas or help!

Example: Here is the text in the variable label, with the final words I'd like to retain bolded at the end. Every variable label has this same huge block of text at the beginning, and a different 2-5 words at the end:

"1.5 Which actions present the greatest opportunity to advance this scenario's goal? Select up to 3 of your previously selected actions, drag them into the box, and rank them in order of their importance for this goal (1=most important). Scenario Context & Assumptions Imagine yourself as a manager responsible for planning all street tree management on a hypothetical street in your region for the next 20 years. The street has a sidewalk and low-rise development, with buildings that are approximately 1-3 stories tall. In this mostly perfect world, you have full control over the trees adjacent to the sidewalk. Within these scenarios, assume the following will be true: Each action you select is in comparison to not doing the actions at all Actions will be performed according to industry standard" Trees will not be removed by events such as vandalism, new construction, extreme storms, or pest invasions Newly planted trees will be watered through establishment and as needed for survival Urban form will not change (street, pavement, buildings will remain the same with reasonable maintenance) Species planted will be adapted for any change in hardiness zone projected in 20 years due to climate change - Ranks - Opportunities to achieve this scenario's goal (rank 3) - plant native species"


Solution

  • library(tidyverse)
    library(labelled)
    

    Create a data.frame with variable labels that resemble your description.

    same <-
      '1.5 Which actions present the greatest opportunity to advance this scenario\'s goal? Select up to 3 of your previously selected actions, drag them into the box, and rank them in order of their importance for this goal (1=most important). Scenario Context & Assumptions Imagine yourself as a manager responsible for planning all street tree management on a hypothetical street in your region for the next 20 years. The street has a sidewalk and low-rise development, with buildings that are approximately 1-3 stories tall. In this mostly perfect world, you have full control over the trees adjacent to the sidewalk. Within these scenarios, assume the following will be true: Each action you select is in comparison to not doing the actions at all Actions will be performed according to industry standard" Trees will not be removed by events such as vandalism, new construction, extreme storms, or pest invasions Newly planted trees will be watered through establishment and as needed for survival Urban form will not change (street, pavement, buildings will remain the same with reasonable maintenance) Species planted will be adapted for any change in hardiness zone projected in 20 years due to climate change - Ranks - Opportunities to achieve this scenario\'s goal (rank 3) - '
    
    var_labels <- paste(same, "plant native species", 1:10)
    
    var_names <- paste("var", 1:10, sep = "_")
    
    
    df <-
      var_names |>
      lapply(\(x) tibble(a = 1:10) |> set_names(x)) |>
      bind_cols() |>
      set_variable_labels(.labels = var_labels)
    

    We can use labelled::get_variable_labels() to obtain all variable labels and then modify them with stringr::str_remove() with a Regular Expression (RegEx), as suggested by user20650 in their comment. The regular expression .*- instructs str_remove() to remove all characters until it hits a final dash/hyphen. If all of the variable labels separate the last bit with a dash, this should work across all labels.

    short_var_labels <-
      df |>
      get_variable_labels() |>
      str_remove(".*-") |>
      trimws()
    
    short_var_labels
    #>  [1] "plant native species 1"  "plant native species 2" 
    #>  [3] "plant native species 3"  "plant native species 4" 
    #>  [5] "plant native species 5"  "plant native species 6" 
    #>  [7] "plant native species 7"  "plant native species 8" 
    #>  [9] "plant native species 9"  "plant native species 10"
    

    Now we can apply the short variable labels to the data.frame.

    df <-
      df |>
      set_variable_labels(.labels = short_var_labels)
    
    get_variable_labels(df)
    #> $var_1
    #> [1] "plant native species 1"
    #> 
    #> $var_2
    #> [1] "plant native species 2"
    #> 
    #> $var_3
    #> [1] "plant native species 3"
    #> 
    #> $var_4
    #> [1] "plant native species 4"
    #> 
    #> $var_5
    #> [1] "plant native species 5"
    #> 
    #> $var_6
    #> [1] "plant native species 6"
    #> 
    #> $var_7
    #> [1] "plant native species 7"
    #> 
    #> $var_8
    #> [1] "plant native species 8"
    #> 
    #> $var_9
    #> [1] "plant native species 9"
    #> 
    #> $var_10
    #> [1] "plant native species 10"