rmachine-learningpredictmultivalue

How to reliably use a matrix (multivariable) response in R using a formula?


I am trying to use a multivariable response in R and came across the damned formulas, and having all sorts of unexpected behaviors, mainly when using them inside functions and packages. This question is twofold

I can input a multivariable response and be able to use the model to predict afterwards?

This MVE, using rpart package, serves as an example. Here y is a two-column matrix (response), and I want to predict y using x, i.e. each column in x (two columns in this MVE). Note that the method itself and what each column in y mean is irrelevant, this is just a MVE to reproduce the problem:

library(rpart)

set.seed(1)
y <- rpois(10, lambda = 1.25)
y <- cbind(c(1,4,10,11,12, 14,16,17,20, 21), y)
print(y)
x <- matrix(1:20, ncol = 2) # just two dummy predictors
print(x)

mymodel <- rpart(y ~ x, method = "poisson", minbucket = 1)

newx <- matrix(11:20, ncol = 2) # just some dummy test predictors, note that we have less rows
predict(mymodel, newdata = data.frame(newx))
# output:
        1          2          3          4          5          6          7          8          
 9         10 
 0.12500000 0.12500000 0.12500000 0.20000000 0.04761905 0.17948718 0.17948718 0.11538462 
 0.04000000 0.04000000 
 Warning message:
 'newdata' had 5 rows but variables found have 10 rows 

As you can see, I can't predict a new dataset. I have been messing with the column and row names and have not been able to make it work.

In addition,

How can I make a wrapper that is "safe"?

For example, in this MVE:

mywrapper <- function(y, x){
  mymodel <- rpart(y ~ x, method = "poisson", minbucket = 1)
  
  return(mymodel)
}

and provided the help provided in the R documentation:

A formula object has an associated environment, and this environment (rather than the parent environment) is used by model.frame to evaluate variables that are not found in the supplied data argument. Formulas created with the ~ operator use the environment in which they were created. Formulas created with as.formula will use the env argument for their environment.

I don't really understand what this is supposed to mean. As far as I understand it, not inputing nor y or x to mywrapper() will result in an error (this is the expected behavior). I ask because I am dealing with r packages and functions inside packages and I want to make sure that there is no unexpected behavior with the formulas.


Solution

  • Generally speaking formulas in R work well with dataframes. rpart works on matrices, and while dataframes can hold matrices, they tend to get converted to separate columns. To avoid this, wrap the matrix in I():

    # Same as your code to start...then this:
    
    predict(mymodel, newdata = data.frame(x = I(newx)))
    #>    1    2    3    4    5 
    #> 0.04 0.04 0.04 0.04 0.04
    

    In the second part of your question, you are creating a formula in the mywrapper function, so that's where it will look for variables if they aren't contained in the newdata dataframe. "Environments" in R are similar to "stack frames" in other languages; the main difference is that environments have a single parent and searches proceed there if the object isn't found in the original.

    Generally speaking the parent is not the frame of the caller, it is the frame where the environment was created, or something specially listed as the parent.

    So what happens if you run predict on the returned value from mywrapper is that it looks at the formula to find what variables it needs. Only the variables on the right hand side are needed for predictions, so that's just x. If you supply x in your newdata argument to predict, everything will be fine and proceed as before, but if you don't, things are different.

    Since x was not found in the newdata, it goes to the environment of the formula. That's the evaluation frame of mywrapper, and it will see x there, since it was an argument to that function.

    If it was looking for z instead, it wouldn't find it there. The next place to look is the parent environment, which is the one in effect when mywrapper was created, i.e. the global environment. If there's no z there, it would search through the chain of environments listed by search(), which are typically package exports. The search() list is chained together so that each entry is the parent of the one before.

    I hope this isn't too much information....