rfunctiondata.table

Unexpected behaviour of "formula" with "data.table" in R


I am trying to dynamically form a formula to use in dynlm. I encounter a behaviour of function that I do not understand, which can be seen from this code:

library(data.table)
dt_test <- data.table("a"=rnorm(10), "b"=1:5)

dt_test[, .(.(
   formula("z_val ~ s_val + q_val + L(s_dval, 32:0) + L(q_dval, 2:0) + 1 + tt + tt2")
 )), .(b)]

The code above is expected to produce (identical) formulas for each value of b. This formula is enclosed in .(.(...)) to return a list, just so that it can be properly stored in a column from the original data.table.

However, the formula returned does not match the string originally provided, but adds a comma between the + and tt, as you can see from the ouput:

       b                                                                           V1
   <int>                                                                       <list>
1:     1 z_val ~ s_val + q_val + L(s_dval, 32:0) + L(q_dval, 2:0) + 1 + ,    tt + tt2
2:     2 z_val ~ s_val + q_val + L(s_dval, 32:0) + L(q_dval, 2:0) + 1 + ,    tt + tt2
3:     3 z_val ~ s_val + q_val + L(s_dval, 32:0) + L(q_dval, 2:0) + 1 + ,    tt + tt2
4:     4 z_val ~ s_val + q_val + L(s_dval, 32:0) + L(q_dval, 2:0) + 1 + ,    tt + tt2
5:     5 z_val ~ s_val + q_val + L(s_dval, 32:0) + L(q_dval, 2:0) + 1 + ,    tt + tt2

Essentially, it adds a comma where there is none. It does so even re-arranging the terms of the sum, but it stops doing it if I erase q_val, for example. The same goes for as.formula.

I would like to understand what is going on and avoid it.


Solution

  • This is just a cosmetic printing issue due to the way R treats long formulas:

    If you run:

    formula(paste0("z_val ~ s_val + q_val + L(s_dval, 32:0) + L(q_dval, 2:0) + 1 + tt + tt2"))
    

    You will see R will default to printing it to 2 lines, cutting it off at "tt + tt2" (no matter how wide the console is):

    #z_val ~ s_val + q_val + L(s_dval, 32:0) + L(q_dval, 2:0) + 1 + 
    #    tt + tt2
    

    This is somewhat meaningful to the way R cosmetically shows you the formula - if you run deparse, it will output a character vector of length 2:

    deparse(formula(paste0("z_val ~ s_val + q_val + L(s_dval, 32:0) + L(q_dval, 2:0) + 1 + tt + tt2")))
    
    # [1] "z_val ~ s_val + q_val + L(s_dval, 32:0) + L(q_dval, 2:0) + 1 + "
    # [2] "    tt + tt2"  
    

    However, assigning your original code as df_formulas, you will see that it stores the formula as normal:

    df_formulas <- dt_test[, .(.(
      formula("z_val ~ s_val + q_val + L(s_dval, 32:0) + L(q_dval, 2:0) + 1 + tt + tt2")
    )), .(b)]
    
    dt_formulas[[2]]
    
    # [[1]]
    # z_val ~ s_val + q_val + L(s_dval, 32:0) + L(q_dval, 2:0) + 1 +
    #   tt + tt2
    # <environment: 0x7fa96ff6ffd8>
    #   
    # [[2]]
    # z_val ~ s_val + q_val + L(s_dval, 32:0) + L(q_dval, 2:0) + 1 +
    #   tt + tt2
    # <environment: 0x7fa96ff6ffd8>
    # ....
    

    As you mentioned, this is also why you don't see the comma if you remove some of the variables in the formula code - it has nothing to do with what specifically you are removing, you're simply reducing the length sufficiently to avoid the automatic line break.