rlinear-regressionpolynomialsorthogonal

R: multivariate orthogonal regression without having to write the variable names explicitly


I have a dataframe train (21 predictors, 1 response, 1012 observations), and I suspect that the response is a nonlinear function of the predictors. Thus, I would like to perform a multivariate polynomial regression of the response on all the predictors, and then try to understand which are the most important terms. To avoid the collinearity problems of standard multivariate polynomial regression, I'd like to use multivariate orthogonal polynomials with polym(). However, I have quite a lot of predictors, and their names do not follow a simple rule. For example, in train I have predictors named X2,X3 and X5, but not X1 and X4. The response is X14. Is there a way to write the formula in lm without having to explicitly write the name of all predictors? Writing

OrthoModel=lm(X14~polym(.,2),data=train)

returns the error

Error in polym(., 2) : object '.' not found

EDIT: the model I wanted to fit contains about 3.5 billion terms, so it's useless. It's better to fit a term with only main effects, interactions and second degree terms -> 231 terms. I wrote the formula for a standard (non-orthogonal) second degree polynomial:

`as.formula(paste(" X14 ~ (", paste0(names(Xtrain), collapse="+"), ")^2", collapse=""))` 

where Xtrain is obtained by train by deleting the response column X14. However, when I try to express the polynomial in an orthogonal basis, I get a parse text error:

    as.formula( 
         paste(" X14 ~ (", paste0(names(Xtrain), collapse="+"), ")^2", "+", 
               paste( "poly(", paste0(names(Xtrain), ", degree=2)", 
                      collapse="+"), 
               collapse="")
     )
     
 )

Solution

  • There are a couple of problems with that approach, one of which you already see but even if the dot could be expanded within polym you would still have faced an error when it came time for the 2 to be evaluated, because degree is a parameter after the "dots" in the polym argument list and it therefore must be supplied as a named parameter rather than just positionally offered.

    An approach using as.formula succeeds (with the 'Orthodont' dataframe in pkg:nlme (although using 'Sex' as the dependent variable is statistically nonsense). I took out the "Subject" column from the data and also took out the "Sex" from the names passed to paste:

    data(Orthodont, package="nlme")
    lm(   as.formula( paste("Sex~polym(" ,
                            paste(names(Orthodont[-(3:4)]), collapse=","),",degree=2)")), 
          data=Orthodont[-3])
    
    Call:
    lm(formula = as.formula(paste("Sex~polym(", paste(names(Orthodont[-(3:4)]), 
        collapse = ","), ",degree=2)")), data = Orthodont[-3])
    
    Coefficients:
                            (Intercept)  polym(distance, age, degree = 2)1.0  
                                 1.4433                              -2.5849  
    polym(distance, age, degree = 2)2.0  polym(distance, age, degree = 2)0.1  
                                 0.4651                               1.3353  
    polym(distance, age, degree = 2)1.1  polym(distance, age, degree = 2)0.2  
                                -7.6514      
    

    Formula objects can be created from text input with as.formula. This is essentially an application of the last example in ?as.formula.