rvectorr-haven

Why isn't df$VARNAME the same as df[,i] when reading data using the R haven package?


Context: I've read the PISA 2022 data using the haven package and now I want to create an auxiliary df that consists of three columns:

The issue is that the label and values are accesible if I type attributes(pisa_df$EFFORT1), but NOT if I type attributes(pisa_df[,i]). Why would this be, and is there a way to get around this? I have >1000 variables so typing them one by one is not an option. I've tried something like pisa_df$get(colnames(pisa_df)[i]) but of course it doesn't work.

It seems like a very newbie question but I can't even figure out how to search for possible answers. Thanks in advance!


Solution

  • Up front, the reason that attributes(pisa_df$EFFORT1) works but attributes(pisa_df[,1]) does not is because of Why does subsetting a column from a data frame vs. a tibble give different results. Namely, in native R, [.data.frame when reducing to a single column drops to a vector, but tbl_df does not. The base [ can choose to not reduce to a vector by adding the drop=FALSE argument.

    mt <- mtcars[1:3,]
    mt[,1]
    # [1] 21.0 21.0 22.8
    mt[,1, drop=FALSE]
    #                mpg
    # Mazda RX4     21.0
    # Mazda RX4 Wag 21.0
    # Datsun 710    22.8
    tibble(mt)[,1]
    # # A tibble: 3 × 1
    #     mpg
    #   <dbl>
    # 1  21  
    # 2  21  
    # 3  22.8
    

    The workaround is to use $ with names or [[ with a column index,

    mt[[1]]
    # [1] 21.0 21.0 22.8
    tibble(mt)[[1]]
    # [1] 21.0 21.0 22.8
    

    In your case, working on a SAS file takes little effort to give what you want. Using the "school questionnaire file" (it was easy for me to get), we can do something like below.

    Up front, I'm demonstrating grabbing the labels and unique values for a few columns. Some of the columns are all unique (e.g., SCH has 21,629 rows, and column CNTSCHID has 21,629 distinct values), so I'm not certain if that is as interesting to you. Regardless, while I'm choosing a few, you can use this for all of them without problem.

    Also, some of the values are character, some are numeric, so we must either convert all numbers to strings, or we have two separate columns. I'll choose the latter for demonstration, as I think converting all to string would be simpler for you to adapt yourself.

    SCH <- haven::read_sas(unz("SCH_QQQ_SAS.zip", "cy08msp_sch_qqq.sas7bdat"))
    library(dplyr)
    columns <- c(1, 3, 4, 5)
    quux <- lapply(columns, function(ind) {
      out <- tibble(column = names(SCH)[ind], label = attributes(SCH[[ind]])$label)
      if (is.character(SCH[[ind]])) {
        cbind(out, tibble(values_chr = unique(SCH[[ind]]))) 
      } else cbind(out, tibble(values_num = unique(SCH[[ind]])))
    }) |>
      bind_rows() |>
      tibble()
    quux
    # # A tibble: 21,794 × 4
    #    column label                    values_chr values_num
    #    <chr>  <chr>                    <chr>           <dbl>
    #  1 CNT    Country code 3-character ALB                NA
    #  2 CNT    Country code 3-character QAZ                NA
    #  3 CNT    Country code 3-character ARG                NA
    #  4 CNT    Country code 3-character AUS                NA
    #  5 CNT    Country code 3-character AUT                NA
    #  6 CNT    Country code 3-character BEL                NA
    #  7 CNT    Country code 3-character BRA                NA
    #  8 CNT    Country code 3-character BRN                NA
    #  9 CNT    Country code 3-character BGR                NA
    # 10 CNT    Country code 3-character KHM                NA
    # # ℹ 21,784 more rows
    # # ℹ Use `print(n = ...)` to see more rows
    

    The notion is that if a particular column is character, then you would use values_chr (for whatever work you're doing). If you choose only character columns, then you can forego the if/else and just put out values of the distinct strings.

    This can be done without dplyr if needed, with just a little more effort.