I'm trying to model the relationship between a categorical predictor variable and a continuous outcome variable. I use lm()
to this end. Since it's a categorical variable, the proper practice is to convert it to a factor variable type. However, when using poly()
for the predictor's regression term and when setting up the predictor variable as a factor it causes lm()
to break. On the other hand, if I run lm()
without using poly()
(but do keep the predictor as factor) or keep poly()
but not convert the predictor to a factor (let it be numeric) -- then lm()
doesn't break. I don't understand why it breaks and I don't understand if I can trust the results when it doesn't break.
Data about 50 basketball players. One column (PosCode
) is about player's position in the game, and the other (Height
) is player's height.
data <-
structure(list(Player = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27,
28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
44, 45, 46, 47, 48, 49, 50), PosCode = c(3, 3, 4, 1, 4, 1, 3,
1, 2, 2, 4, 1, 5, 5, 2, 1, 2, 5, 4, 4, 5, 4, 4, 4, 2, 3, 2, 3,
1, 1, 2, 4, 1, 2, 3, 1, 5, 4, 3, 4, 4, 1, 1, 4, 5, 1, 1, 1, 5,
2), Height = c(176.1, 179.1, 183.1, 169.7, 177.3, 179, 176.4,
174.9, 180.2, 176.5, 178.6, 167.9, 183.4, 166.2, 189.5, 171.9,
188.5, 172.6, 167.7, 172.6, 186.9, 163.8, 179.3, 165.4, 182.2,
166.1, 176.8, 171.9, 173.8, 163, 172.5, 184.9, 170.4, 170.6,
166.8, 172.6, 184.3, 163.3, 182.4, 165.8, 173.4, 182.1, 172.9,
184.9, 173.2, 185.8, 161.4, 186, 178.4, 170.7)), row.names = c(NA,
-50L), class = c("tbl_df", "tbl", "data.frame"))
> data
## # A tibble: 50 x 3
## Player PosCode Height
## <dbl> <dbl> <dbl>
## 1 1 3 176.
## 2 2 3 179.
## 3 3 4 183.
## 4 4 1 170.
## 5 5 4 177.
## 6 6 1 179
## 7 7 3 176.
## 8 8 1 175.
## 9 9 2 180.
## 10 10 2 176.
## # ... with 40 more rows
I want to know whether I can predict players height from their position in the game. Since position is categorical (there are 5 possible positions), this variable should be of a factor type, with 5 levels.
library(tidyverse)
library(magrittr)
data %<>% mutate_at(vars(PosCode), ~ as.factor(.)) ## convert PosCode from dbl to fct
lm()
without poly()
lm(Height ~ PosCode, data = data)
## Call:
## lm(formula = Height ~ PosCode, data = data)
##
## Coefficients:
## (Intercept) PosCode2 PosCode3 PosCode4 PosCode5
## 173.6714 4.9397 0.4429 0.1824 4.1857
lm()
with poly()
lm(Height ~ poly(PosCode ,1), data = data)
## Error in qr.default(X) : NA/NaN/Inf in foreign function call (arg 1)
## In addition: Warning messages:
## 1: In mean.default(x) : argument is not numeric or logical: returning NA
## 2: In Ops.factor(x, xbar) : ‘-’ not meaningful for factors
poly()
## convert PosCode from fct back to dbl
data %<>% mutate_at(vars(PosCode), ~ as.double(.))
## lm() without poly()
lm(Height ~ PosCode, data = data)
Call:
lm(formula = Height ~ PosCode, data = data)
## Coefficients:
## (Intercept) PosCode
## 174.3848 0.3112
## lm() with poly()
lm(Height ~ poly(PosCode ,1), data = data)
## Call:
## lm(formula = Height ~ poly(PosCode, 1), data = data)
## Coefficients:
## (Intercept) poly(PosCode, 1)
## 175.256 3.173
But clearly, treating PosCode
as dbl
rather than fct
changes the model in a way that is wrong.
I don't understand why including poly()
in lm()
breaks it when the predictor is set up as a factor variable.
From help("poly")
:
x
a numeric vector at which to evaluate the polynomial.
Thus, you cannot use factors inside poly()
. This is expected, because categorical variables (i.e., factors) have to be recoded, for example, into dummy variables. And it does neither make sense to have, say, a quadratic effect for the categorical variable as a whole nor for the coded (dummy) variables. (It does not make sense from a substantive perspective, but squaring a dummy variable that has only 0s and 1s does also not make much sense from a perspective blind to statistics.)
You can see that lm()
recodes your factor because you get four coefficents (one less than the number of categories) for the variable PosCode
in your first model.
In the end, poly()
is not of much use unless you set its argument degree
to a value > 1