Context: I've read the PISA 2022 data using the haven
package and now I want to create an auxiliary df that consists of three columns:
The issue is that the label and values are accesible if I type attributes(pisa_df$EFFORT1)
, but NOT if I type attributes(pisa_df[,i])
. Why would this be, and is there a way to get around this? I have >1000 variables so typing them one by one is not an option. I've tried something like pisa_df$get(colnames(pisa_df)[i])
but of course it doesn't work.
It seems like a very newbie question but I can't even figure out how to search for possible answers. Thanks in advance!
Up front, the reason that attributes(pisa_df$EFFORT1)
works but attributes(pisa_df[,1])
does not is because of Why does subsetting a column from a data frame vs. a tibble give different results. Namely, in native R, [.data.frame
when reducing to a single column drops to a vector, but tbl_df
does not. The base [
can choose to not reduce to a vector by adding the drop=FALSE
argument.
mt <- mtcars[1:3,]
mt[,1]
# [1] 21.0 21.0 22.8
mt[,1, drop=FALSE]
# mpg
# Mazda RX4 21.0
# Mazda RX4 Wag 21.0
# Datsun 710 22.8
tibble(mt)[,1]
# # A tibble: 3 × 1
# mpg
# <dbl>
# 1 21
# 2 21
# 3 22.8
The workaround is to use $
with names or [[
with a column index,
mt[[1]]
# [1] 21.0 21.0 22.8
tibble(mt)[[1]]
# [1] 21.0 21.0 22.8
In your case, working on a SAS file takes little effort to give what you want. Using the "school questionnaire file" (it was easy for me to get), we can do something like below.
Up front, I'm demonstrating grabbing the labels and unique values for a few columns. Some of the columns are all unique (e.g., SCH
has 21,629 rows, and column CNTSCHID
has 21,629 distinct values), so I'm not certain if that is as interesting to you. Regardless, while I'm choosing a few, you can use this for all of them without problem.
Also, some of the values are character
, some are numeric
, so we must either convert all numbers to strings, or we have two separate columns. I'll choose the latter for demonstration, as I think converting all to string would be simpler for you to adapt yourself.
SCH <- haven::read_sas(unz("SCH_QQQ_SAS.zip", "cy08msp_sch_qqq.sas7bdat"))
library(dplyr)
columns <- c(1, 3, 4, 5)
quux <- lapply(columns, function(ind) {
out <- tibble(column = names(SCH)[ind], label = attributes(SCH[[ind]])$label)
if (is.character(SCH[[ind]])) {
cbind(out, tibble(values_chr = unique(SCH[[ind]])))
} else cbind(out, tibble(values_num = unique(SCH[[ind]])))
}) |>
bind_rows() |>
tibble()
quux
# # A tibble: 21,794 × 4
# column label values_chr values_num
# <chr> <chr> <chr> <dbl>
# 1 CNT Country code 3-character ALB NA
# 2 CNT Country code 3-character QAZ NA
# 3 CNT Country code 3-character ARG NA
# 4 CNT Country code 3-character AUS NA
# 5 CNT Country code 3-character AUT NA
# 6 CNT Country code 3-character BEL NA
# 7 CNT Country code 3-character BRA NA
# 8 CNT Country code 3-character BRN NA
# 9 CNT Country code 3-character BGR NA
# 10 CNT Country code 3-character KHM NA
# # ℹ 21,784 more rows
# # ℹ Use `print(n = ...)` to see more rows
The notion is that if a particular column is character
, then you would use values_chr
(for whatever work you're doing). If you choose only character
columns, then you can forego the if
/else
and just put out values
of the distinct strings.
This can be done without dplyr
if needed, with just a little more effort.