rreadr

Read the column specification col_types of readr::read_delim from file


How can I read the column specification col_types of the readr::read_delim function from a file?

Instead of

> read_csv(file = I('varInt,varChar,varFac\n
+                    1,a,A1\n
+                    2,b,A2\n
+                    3,c,A3'),
+          col_types = cols(varInt = 'i',
+                           varChar = 'c',
+                           varFac = col_factor(levels = c('A1', 'A2', 'A3'))))
# A tibble: 3 × 3                                                                                                                                                             
  varInt varChar varFac
   <int> <chr>   <fct> 
1      1 a       A1    
2      2 b       A2    
3      3 c       A3     

I want to do something like

mySpecFile <- read_csv(file = I("Variable,Spec\n
                                 varInt,i\n
                                 varChar,c\n
                                 varFac,col_factor(levels = c('A1'; 'A2'; 'A3'))"))

mySpec <- mySpecFile |> pull(Spec, Variable) |> as.list()

read_csv(file = I('varInt,varChar,varFac\n
                   1,a,A1\n
                   2,b,A2\n
                   3,c,A3'),
         col_types = mySpec)

But this throws: Error: Unknown shortcut: col_factor(levels = c('A1'; 'A2'; 'A3'))

So, specifying levels of factors does not work for me.

Seems to be related: R readr col_types specified in a metadata file, specifically using custom date formats

However, the readr::read_delim documentation says

One of NULL, a cols() specification, or a string. See vignette("readr") for more details.

If NULL, all column types will be inferred from guess_max rows of the input, interspersed throughout the file. This is convenient (and fast), but not robust. If the guessed types are wrong, you'll need to increase guess_max or supply the correct types yourself.

Column specifications created by list() or cols() must contain one column specification for each column. If you only want to read a subset of the columns, use cols_only().

Alternatively, you can use a compact string representation where each character represents one column:


Solution

  • A few things:

    library(readr)
    mySpecFile <- read_csv2(file = I("Variable;Spec\n
                                     varInt;i\n
                                     varChar;c\n
                                     varFac;col_factor(levels = c('A1', 'A2', 'A3'))"))
    # ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
    # Rows: 3 Columns: 2
    # ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
    # Delimiter: ";"
    # chr (2): Variable, Spec
    # ℹ Use `spec()` to retrieve the full column specification for this data.
    # ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    mySpec <- mySpecFile |>
      pull(Spec, Variable) |>
      as.list() |>
      lapply(function(z) if (nchar(z) > 1) tryCatch(eval(parse(text = z)), error = function(e) z) else z)
    read_csv(file = I('varInt,varChar,varFac\n
                       1,a,A1\n
                       2,b,A2\n
                       3,c,A3'),
             col_types = mySpec)
    # # A tibble: 3 × 3
    #   varInt varChar varFac
    #    <int> <chr>   <fct> 
    # 1      1 a       A1    
    # 2      2 b       A2    
    # 3      3 c       A3    
    

    The if (nchar(z) > 1) is to guard against "c" (for character) becoming an R function (and possibly other things). If you want more specificity, change that conditional to something else.

    The tryCatch(.., error = function(e) z) ensures that if it is not an expression, it returns the original string.

    As an alternative to using ;-delimited text, we can quote them (or just the one string) to protect the embedded commas we need.

    mySpecFile <- read_csv(file = I("Variable,Spec\n
                                     varInt,i\n
                                     varChar,c\n
                                     varFac,\"col_factor(levels = c('A1', 'A2', 'A3'))\""))