I have two data frames, one is survey data (data.csv) and another one is label data (label.csv). Here is the sample data (My original data has about 150 variables)
#sample data
df <- tibble::tribble(
~id, ~House_member, ~dob, ~age_quota, ~work, ~sex, ~pss,
1L, 4L, 1983L, 2L, 2L, 1, 1,
2L, 1L, 1940L, 7L, 2L, 1, 2,
3L, 2L, 1951L, 5L, 6L, 1, 1,
4L, 4L, 1965L, 2L, 2L, 1, 4,
5L, 3L, 1965L, 2L, 3L, 1, 1,
6L, 1L, 1951L, 3L, 1L, 1, 3,
7L, 1L, 1955L, 1L, 1L, 1, 3,
8L, 4L, 1982L, 2L, 2L, 2, 5,
9L, 2L, 1990L, 2L, 4L, 2, 3,
10L, 2L, 1953L, 3L, 2L, 2, 4
)
#sample label data
label <- tibble::tribble(
~variable, ~value, ~label,
"House_member", NA, "How many people live with you?",
"House_member", 1L, "1 person",
"House_member", 2L, "2 persons",
"House_member", 3L, "3 persons",
"House_member", 4L, "4 persons",
"House_member", 5L, "5 persons",
"House_member", 6L, "6 persons",
"House_member", 7L, "7 persons",
"House_member", 8L, "8 persons",
"House_member", 9L, "9 persons",
"House_member", 10L, "10 or more",
"dob", NA, "date of brith",
"age_quota", NA, "age_quota",
"age_quota", 1L, "10-14",
"age_quota", 2L, "15-19",
"age_quota", 3L, "20-29",
"age_quota", 4L, "30-39",
"age_quota", 5L, "40-49",
"age_quota", 6L, "50-70",
"age_quota", 7L, "70 +",
"work", NA, "what is your occupation?",
"work", 1L, "full time",
"work", 2L, "part time",
"work", 3L, "retired",
"work", 4L, "student",
"work", 5L, "housewife",
"work", 6L, "unemployed",
"work", 7L, "other",
"work", 8L, "kid under 15",
"sex", NA, "gender?",
"sex", 1L, "Man",
"sex", 2L, "Woman",
"pss", NA, "How often do you use PS?",
"pss", 1L, "Daily",
"pss", 2L, "several times per week",
"pss", 3L, "once per week",
"pss", 4L, "several time per month",
"pss", 5L, "Rarly"
)
I am wondering is there any way that I can combine these file together to have a one labelled dataframe like SPSS
’s style format (dbl+lbl format). I know labelled
package which can add a value label to a non labelled vector, like this example:
v <- labelled::labelled(c(1,2,2,2,3,9,1,3,2,NA), c(yes = 1, maybe = 2, no = 3))
I was hoping there is a better/faster way than adding label to each variable one by one.
Another imap_dfc
solution:
library(tidyverse)
df %>% imap_dfc(~{
label[label$variable==.y,c('label','value')] %>%
deframe() %>% # to named vector
haven::labelled(.x,.)
})
# A tibble: 10 x 7
id House_member dob age_quota work sex pss
<int+lbl> <int+lbl> <int+lbl> <int+lbl> <int+lbl> <dbl+lbl> <dbl+lbl>
1 1 4 [4 persons] 1983 2 [15-19] 2 [part time] 1 [Man] 1 [Daily]
2 2 1 [1 person] 1940 7 [70 +] 2 [part time] 1 [Man] 2 [several times per week]
3 3 2 [2 persons] 1951 5 [40-49] 6 [unemployed] 1 [Man] 1 [Daily]
4 4 4 [4 persons] 1965 2 [15-19] 2 [part time] 1 [Man] 4 [several time per month]
5 5 3 [3 persons] 1965 2 [15-19] 3 [retired] 1 [Man] 1 [Daily]
6 6 1 [1 person] 1951 3 [20-29] 1 [full time] 1 [Man] 3 [once per week]
7 7 1 [1 person] 1955 1 [10-14] 1 [full time] 1 [Man] 3 [once per week]
8 8 4 [4 persons] 1982 2 [15-19] 2 [part time] 2 [Woman] 5 [Rarly]
9 9 2 [2 persons] 1990 2 [15-19] 4 [student] 2 [Woman] 3 [once per week]
10 10 2 [2 persons] 1953 3 [20-29] 2 [part time] 2 [Woman] 4 [several time per month]
Used tibble::deframe
and haven::labelled
which are included in tidyverse
Speed comparison after replacing filter
/select
by direct access to label
:
Waldi <- function() {
df %>% imap_dfc(~{
label[label$variable==.y,c('label','value')] %>%
deframe() %>% # to named vector
haven::labelled(.x,.)})}
Waldi_old <- function() {
df %>% imap_dfc(~{
label %>% filter(variable==.y) %>%
select(label, value) %>%
deframe() %>% # to named vector
haven::labelled(.x,.)
})}
#EDIT : Included TIC33() for-loop solution
microbenchmark::microbenchmark(TIC3(),Waldi(),Anil(),TIC1(),Waldi_old(),Sinh())
Unit: microseconds
expr min lq mean median uq max neval cld
TIC3() 688.0 871.80 982.280 920.95 1005.55 1801.6 100 a
Waldi() 1345.5 1543.60 1804.758 1635.45 1893.75 4306.8 100 b
Anil() 4006.8 4476.65 5188.519 4862.95 5439.10 10163.6 100 c
TIC1() 3898.2 4278.80 5009.927 4774.95 5277.05 12916.2 100 c
Waldi_old() 18712.3 20091.75 21756.140 20609.35 22169.75 33359.8 100 d
Sinh() 22730.9 24093.45 25931.412 24946.00 26614.00 38735.3 100 e