rsapplyr-haven

Run a function across columns without stripping Labelled attributes


I have a large number of SPSS files, and I need to generate random data that looks like the data that is in the file.

When I read in the SPSS files (using haven's read_sav), they come in with variable and value labels (from labelled), and I would like each variable to have those same attributes when I write the SPSS file after scrambling the data. However, when I scramble the sequence of each column independently, sapply strips the Labelled attributes (because it's returning a matrix that I'm coercing into a data.frame).

How can I do this without stripping those attributes? See example below:

dat<-data.frame(a=c(1,2,3,4,5,6,7,8,9,10),b=c("a","b","c","d","e","f","g","h","i","j"))

var_label(dat$a)<-"The first variable"
val_labels(dat$a)<-c(first=1,
                     second=2,
                     third=3,
                     fourth=4,
                     fifth=5,
                     sixth=6,
                     seventh=7,
                     eighth=8,
                     ninth=9,
                     tenth=10)

var_label(dat$b)<-"The second variable"

# Variable has variable and value labels
dat$a

faker<-function(thing){
  thing<-sample(thing,length(thing),replace=TRUE)
  thing
}

rat<-as.data.frame(sapply(dat,faker))

# Variable no longer has variable and value labels
rat$a

edited to correct a typo on the last line of the code, which was dat$a and should have been rat$a


Solution

  • (Up front, I'm assuming your reassignment to rat should really be dat, otherwise it is not reproducible and shouldn't be since dat is not changed when forming rat.)

    Your use of sapply is homogenizing and dumbing-down the data, switch to lapply.

    # ...
    dat<-as.data.frame(sapply(dat,faker))
    dat$a
    #  [1] "1"  "7"  "5"  "7"  "10" "2"  "1"  "4"  "9"  "4" 
    

    The fix,

    dat <- as.data.frame(lapply(dat,faker))
    dat$a
    # <labelled<double>[10]>: The first variable
    #  [1]  2  3  8  9  8  5  5 10  1  3
    # Labels:
    #  value   label
    #      1   first
    #      2  second
    #      3   third
    #      4  fourth
    #      5   fifth
    #      6   sixth
    #      7 seventh
    #      8  eighth
    #      9   ninth
    #     10   tenth
    

    Side note: when applying something to all columns of a frame, instead of dat <- as.data.frame(lapply(...)), I tend to use dat[] <- lapply(...), as it preserves the frame's attributes and replaces/augments the columns' contents.

    # ...
    dat[] <- lapply(dat, faker)
    dat$a
    # <labelled<double>[10]>: The first variable
    #  [1] 4 8 7 7 9 7 9 1 3 5
    # Labels:
    #  value   label
    #      1   first
    #      2  second
    #      3   third
    #      4  fourth
    #      5   fifth
    #      6   sixth
    #      7 seventh
    #      8  eighth
    #      9   ninth
    #     10   tenth