I have a dataframe train
(21 predictors, 1 response, 1012 observations), and I suspect that the response is a nonlinear function of the predictors. Thus, I would like to perform a multivariate polynomial regression of the response on all the predictors, and then try to understand which are the most important terms. To avoid the collinearity problems of standard multivariate polynomial regression, I'd like to use multivariate orthogonal polynomials with polym()
. However, I have quite a lot of predictors, and their names do not follow a simple rule. For example, in train
I have predictors named X2
,X3
and X5
, but not X1
and X4
. The response is X14
. Is there a way to write the formula in lm
without having to explicitly write the name of all predictors? Writing
OrthoModel=lm(X14~polym(.,2),data=train)
returns the error
Error in polym(., 2) : object '.' not found
EDIT: the model I wanted to fit contains about 3.5 billion terms, so it's useless. It's better to fit a term with only main effects, interactions and second degree terms -> 231 terms. I wrote the formula for a standard (non-orthogonal) second degree polynomial:
`as.formula(paste(" X14 ~ (", paste0(names(Xtrain), collapse="+"), ")^2", collapse=""))`
where Xtrain
is obtained by train
by deleting the response column X14
. However, when I try to express the polynomial in an orthogonal basis, I get a parse text error:
as.formula(
paste(" X14 ~ (", paste0(names(Xtrain), collapse="+"), ")^2", "+",
paste( "poly(", paste0(names(Xtrain), ", degree=2)",
collapse="+"),
collapse="")
)
)
There are a couple of problems with that approach, one of which you already see but even if the dot could be expanded within polym
you would still have faced an error when it came time for the 2
to be evaluated, because degree
is a parameter after the "dots" in the polym
argument list and it therefore must be supplied as a named parameter rather than just positionally offered.
An approach using as.formula
succeeds (with the 'Orthodont' dataframe in pkg:nlme (although using 'Sex' as the dependent variable is statistically nonsense). I took out the "Subject" column from the data and also took out the "Sex" from the names passed to paste
:
data(Orthodont, package="nlme")
lm( as.formula( paste("Sex~polym(" ,
paste(names(Orthodont[-(3:4)]), collapse=","),",degree=2)")),
data=Orthodont[-3])
Call:
lm(formula = as.formula(paste("Sex~polym(", paste(names(Orthodont[-(3:4)]),
collapse = ","), ",degree=2)")), data = Orthodont[-3])
Coefficients:
(Intercept) polym(distance, age, degree = 2)1.0
1.4433 -2.5849
polym(distance, age, degree = 2)2.0 polym(distance, age, degree = 2)0.1
0.4651 1.3353
polym(distance, age, degree = 2)1.1 polym(distance, age, degree = 2)0.2
-7.6514
Formula objects can be created from text input with as.formula
. This is essentially an application of the last example in ?as.formula
.