I have the following data, the response variable which is categorical is gleason_score, while the predictor, a continuous variable is age. I have fitted the data using glm
function to have a logistic regression model.
structure(list(age = c(69L, 60L, 78L, 73L, 80L, 78L, 89L, 75L,
66L, 74L, 72L, 80L, 63L, 100L, 67L, 73L, 75L, 83L, 72L, 73L,
50L, 75L, 70L, 56L, 75L, 70L, 90L, 65L, 70L, 80L, 73L, 70L, 70L,
75L, 71L, 65L, 65L, 72L, 67L, 65L, 70L, 75L, 85L, 75L, 70L, 86L,
74L, 78L, 64L, 70L, 65L, 65L, 70L, 74L, 77L, 75L, 65L, 80L, 70L,
70L, 58L, 58L, 65L, 78L, 76L, 80L, 66L, 71L, 70L, 55L, 70L, 90L,
78L, 67L, 65L, 60L, 69L, 80L, 72L, 76L, 68L, 77L, 88L, 69L, 79L,
77L, 78L, 66L, 80L, 72L, 81L, 80L, 70L, 86L, 87L, 70L, 80L, 66L,
60L, 50L, 69L, 63L, 75L, 68L, 68L, 75L, 63L, 74L, 54L, 81L, 72L,
70L, 68L, 55L, 75L, 75L, 65L, 72L, 77L, 64L, 64L, 76L, 83L, 95L,
85L, 70L, 75L, 75L, 61L, 95L, 72L, 81L, 87L, 70L, 77L, 70L, 65L,
77L, 70L, 65L, 70L, 75L, 68L, 93L, 65L, 65L, 75L, 78L, 86L),
gleason_score = structure(c(2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L,
2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L,
2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L,
2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 1L, 2L, 2L), .Label = c("0", "1"), class = "factor")), class = "data.frame", row.names = c(NA,
-149L))
Using the following code:
attach(data2)
glm.fit=glm(gleason_score ~ age, family=binomial(link = "logit"))
plot(x=age, y=gleason_score)
lines(age, glm.fit$fitted.values)
summary(glm.fit)
I have these results:
Call:
glm(formula = gleason_score ~ age, family = binomial(link = "logit"))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1515 0.4451 0.5806 0.6530 1.0105
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.17146 1.87588 -1.158 0.2470
age 0.05155 0.02651 1.944 0.0518 .
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 141.02 on 148 degrees of freedom
Residual deviance: 137.00 on 147 degrees of freedom
AIC: 141
Number of Fisher Scoring iterations: 4
I use the following code to build the equation:
equatiomatic::extract_eq(glm.fit, wrap = FALSE,use_coefs = TRUE)
but I have this error message:
Error in model$data[which(model$y == 1)[1], outcome_nm] : object of type 'environment' is not subsettable
Kindly assist in tracing my error
Do not use attach(data2)
.. Instead, pass data2
to the data
argument of the glm()
call.
glm.fit=glm(gleason_score ~ age, data=data2, family=binomial(link = "logit"))
equatiomatic::extract_eq(glm.fit, wrap = FALSE,use_coefs = TRUE)
Output:
$$
\log\left[ \frac { \widehat{P( \operatorname{gleason\_score} = \operatorname{1} )} }{ 1 - \widehat{P( \operatorname{gleason\_score} = \operatorname{1} )} } \right] = -2.17 + 0.05(\operatorname{age})
$$
To see why, compare glm.fit$data
when glm.fit
is created
attach(data2)
vs.attach
and instead passing data2
to data
argThe 2nd approach is correct.
Under the first approach (yours), glm.fit$data
returns this:
<environment: R_GlobalEnv>
Under the second approach (correct), glm.fit$data
returns the actual data (note only first six rows shown here)
age gleason_score
1 69 1
2 60 1
3 78 0
4 73 1
5 80 1
6 78 1