I have a data set with a binary variable[Yes/No] and a continuous variable (X). I'm trying to make a model to classify [Yes/No] X.
From my data set, when X = 0.5, 48% of the observations are Yes. However, I know the true probability for Yes should be 50% when X = 0.5. When I create a model using logistic regression X = 0.5 != P[Yes=0.5].
How can I correct this? I guess all probabilities should be slightly underestimated if it does not pass true the correct point.
Is it correct just to add a bunch of observations in my sample to adjust the proportion?
Does not have to be just logistic regression, LDA, QDA etc is also of interest.
I have searched Stack Overflow, but only found topics regarding linear regression.
I believe that in R (assuming you're using glm
from base R) you just need
glm(y~I(x-0.5)-1,data=your_data,family=binomial)
the I(x-0.5)
recenters the covariate at 0.5, the -1
suppresses the intercept (intercept = 0 at x=0.5
-> probability = 0.5 at x=0.5
).
For example:
set.seed(101)
dd <- data.frame(x=runif(100,0.5,1),y=rbinom(100,size=1,prob=0.7))
m1 <- glm(y~I(x-0.5)-1,data=dd,family=binomial)
predict(m1,type="response",newdata=data.frame(x=0.5)) ## 0.5