I have a large number of SPSS files, and I need to generate random data that looks like the data that is in the file.
When I read in the SPSS files (using haven
's read_sav
), they come in with variable and value labels (from labelled), and I would like each variable to have those same attributes when I write the SPSS file after scrambling the data. However, when I scramble the sequence of each column independently, sapply
strips the Labelled attributes (because it's returning a matrix that I'm coercing into a data.frame
).
How can I do this without stripping those attributes? See example below:
dat<-data.frame(a=c(1,2,3,4,5,6,7,8,9,10),b=c("a","b","c","d","e","f","g","h","i","j"))
var_label(dat$a)<-"The first variable"
val_labels(dat$a)<-c(first=1,
second=2,
third=3,
fourth=4,
fifth=5,
sixth=6,
seventh=7,
eighth=8,
ninth=9,
tenth=10)
var_label(dat$b)<-"The second variable"
# Variable has variable and value labels
dat$a
faker<-function(thing){
thing<-sample(thing,length(thing),replace=TRUE)
thing
}
rat<-as.data.frame(sapply(dat,faker))
# Variable no longer has variable and value labels
rat$a
edited to correct a typo on the last line of the code, which was dat$a and should have been rat$a
(Up front, I'm assuming your reassignment to rat
should really be dat
, otherwise it is not reproducible and shouldn't be since dat
is not changed when forming rat
.)
Your use of sapply
is homogenizing and dumbing-down the data, switch to lapply
.
# ...
dat<-as.data.frame(sapply(dat,faker))
dat$a
# [1] "1" "7" "5" "7" "10" "2" "1" "4" "9" "4"
The fix,
dat <- as.data.frame(lapply(dat,faker))
dat$a
# <labelled<double>[10]>: The first variable
# [1] 2 3 8 9 8 5 5 10 1 3
# Labels:
# value label
# 1 first
# 2 second
# 3 third
# 4 fourth
# 5 fifth
# 6 sixth
# 7 seventh
# 8 eighth
# 9 ninth
# 10 tenth
Side note: when applying something to all columns of a frame, instead of dat <- as.data.frame(lapply(...))
, I tend to use dat[] <- lapply(...)
, as it preserves the frame's attributes and replaces/augments the columns' contents.
# ...
dat[] <- lapply(dat, faker)
dat$a
# <labelled<double>[10]>: The first variable
# [1] 4 8 7 7 9 7 9 1 3 5
# Labels:
# value label
# 1 first
# 2 second
# 3 third
# 4 fourth
# 5 fifth
# 6 sixth
# 7 seventh
# 8 eighth
# 9 ninth
# 10 tenth