I am getting a strange output using tab_model() after fitting a beta regression model that does not seem to make sense. The intercept and the coefficients dont seem to correspond to the plot. For example the CI for soil_moisture seems way to high. Also, shouldnt there be two intercepts since I have to regression lines?
What am I missing?
This is my data:
df <- structure(list(weed_coverage = c(0.002, 0.01, 0.015, 0.01, 0.001,
0.03, 0.006, 0.012, 0.03, 0.01, 0.002, 0.05, 0.004, 0.02, 0.02,
0.006, 0.03, 0.01, 0.015, 0.01), soil_moisture = c(0.1125, 0.1343,
0.1662, 0.3402, 0.2195, 0.1923, 0.2277, 0.2577, 0.148, 0.2715,
0.104, 0.1495, 0.2788, 0.3477, 0.1835, 0.3175, 0.134, 0.3488,
0.3477, 0.1097), distance = structure(c(2L, 2L, 1L, 2L, 1L, 1L,
1L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L), .Label = c("2",
"5"), class = "factor"), wc_pr = c(`1` = 0.0144421159096612,
`2` = 0.0148173851570077, `3` = 0.0146600960637327, `4` = 0.0188698067335207,
`5` = 0.0132256975788894, `6` = 0.0139395025623511, `7` = 0.0130176686719618,
`8` = 0.0171297102952414, `9` = 0.0150581171360966, `10` = 0.0119600057973879,
`11` = 0.0142983464947494, `12` = 0.0150847074475541, `13` = 0.0117921770637613,
`14` = 0.019036340203918, `15` = 0.0141784723499839, `16` = 0.0109405819057696,
`17` = 0.0148121562892363, `18` = 0.0190608859962305, `19` = 0.0103185336737258,
`20` = 0.0163480105406738)), class = "data.frame", row.names = c(NA,
-20L))
This is my code:
library(sjPlot)
library(betareg)
library(ggplot2)
betareg (weed_coverage ~ soil_moisture * distance, data = df) -> model_b # fit beta regression model
tab_model(model_b)
df %>% mutate(wc_pr= predict(model_b , type = "response")) -> df # create column with prediction values for weed_coverage
ggplot(df, aes(x = soil_moisture, y = weed_coverage, color = distance)) + # Plot the model
geom_point(size = 2, shape = 21) +
geom_line(aes(y = wc_pr, color = distance), data = df)+
theme_bw()
The output of tab_model(model_b):
This is the plot I get:
Okay, so there are a number of things happening here that I'm going to go through. The first is the easiest, which is to clarify that this model only has one intercept, which is the estimated value of weed_coverage
when your independent variables are equal to zero. For this reason, these values are not always directly interpretable.
The two regression lines on your plot are the predicted slope of soil_moisture
at the two different values of distance. This is because there is an interaction, so the slope of the line for soil_moisture
is dependent on the level of distance
.
Now, the CIs are being correctly reported by tab_model()
, but the default behaviour of most functions in the sjPlot package is to transform values on the log scale using the exp()
function, to get something closer to the probability values. If you want the original model values in the log space where the model was fit, you have to specify transform=NULL
like this:
tab_model(model_b,
transform = NULL)
You will now notice that the model is reporting the same numbers as if you call the following:
summary(model_b)
Call:
betareg(formula = weed_coverage ~ soil_moisture * distance, data = df)
Standardized weighted residuals 2:
Min 1Q Median 3Q Max
-2.2157 -0.5334 -0.0546 0.8254 1.7902
Coefficients (mean model with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.8822 0.7840 -4.952 7.35e-07 ***
soil_moisture -1.9591 3.3122 -0.591 0.554
distance5 -0.4752 0.9303 -0.511 0.610
soil_moisture:distance5 3.1533 3.9372 0.801 0.423
Phi coefficients (precision model with identity link):
Estimate Std. Error z value Pr(>|z|)
(phi) 102.30 34.72 2.946 0.00322 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Type of estimator: ML (maximum likelihood)
Log-likelihood: 65.73 on 5 Df
Pseudo R-squared: 0.05632
Number of iterations: 399 (BFGS) + 6 (Fisher scoring)
The CIs are very large, but that is because all of the values for weed_coverage
are very low. In the log-odds space, small percentage or proportion changes become very large when they are close to 0 or 1. In this example model, the model is very uncertain, which is why the back-transformed CIs on the probability scale are so large. It is also worth noting that these estimates are for the change in weed_coverage
for a change in 1 of soil_moisture
. If you scaled this variable, the model would not change but the numbers would be easier to interpret in the context of your data, as instead of estimating the uncertainty in the regression line far beyond the confines of your data, everything will be reporting for a change of 1 standard deviation of soil_moisture
:
df$scaled_soil_moisture <- scale(df$soil_moisture)
model_b2 <- betareg(weed_coverage ~ scaled_soil_moisture * distance, data = df)
# Notice that I have let it back-transform the variable again.
tab_model(model_b2)
Please let me know if there is anything here that you do not understand.
EDIT: To clear up the question you had in the comment about the intercept, I plotted the two regression lines from the regression with the scaled independent variable. The intercept in this model corresponds to the value of weed_coverage
when the scaled_soil_moisture
equals zero and the distance is equal to 2 (because distance is a two-level factor, the default behaviour is usually to create a dummy variable with the first level being represented by the intercept). As you can see, on the red line when the x-axis is equal to 0, the y-axis is approximately 0.01 (it has been rounded down). This is in contrast the the non-scaled model, where the intercept is not pictured on the graph because there are no values of zero in the data. If you continued the red line past the data to where x=0, you would see that it crosses the y-axis around 0.02.
This page explains this is a fairly intuitive way.