I'm trying to understand why when I plot marginal effects using different packages I get different plots. Admittedly, I suspect it's because I do not understand how data transformations and offsets are being handled by the respective packages. I'm afraid I don't know how to make a reproducible sample of data, but I hope I can provide enough context below.
I have the following model using package MASS
:
model <- glm(Lethal_removal ~ Federal_land + state / pred_rate + offset(log(subtotal)), data=df3, family=negative.binomial(1), na.action=na.fail)
The unit of analysis is state counties. State
is a factor with 4 US states, pred_rate
is an annual measure of how many livestock are eaten per county by wild predators (mean of 3.5 per year), lethal_removal
is the number of wild predators that are culled per county by wildlife agencies (mean of 1.6 per year). Offset
is number of years (subtotal
) that predators are present in each county. I want to plot the marginal effects of pred_rate
on lethal_removal
by state
while accounting for the subtotal
offset (i.e., convert to an annual rate).
I used package emmeans
first:
plot_em<-emmip(model, state ~ pred_rate, cov.reduce = range, offset = log(1), CIs = TRUE, plotit=FALSE)
#and plot it with the observed data
ggplot(as.data.frame(plot_em), aes(x=pred_rate, y=yvar, color=state)) +
geom_line() + geom_ribbon(aes(ymax=UCL, ymin=LCL, fill=state), alpha=0.4) +
geom_point(data = df3, aes(y=Lethal_removal/subtotal), size = 3)
I then try ggeffects
which gives me:
pr <- predict_response(model, c("pred_rate", "state"), condition = c(subtotal = 1))
plot(pr, show_data = TRUE)
Which suggests it was on an exponential scale, so I log-transformed the y-axis plot(pr, show_data = TRUE, log_y = TRUE)
which gave me this:
Which is very similar to the results from emmeans
but the scale on the y-axis is way off. Frankly, the plot from emmeans
makes intuitive sense, but my inability to easily recreate it leaves me concerned that I'm missing something and I'm not sure what is the correct marginal effects plot. Any advice huge appreciated.
Below is dput of the data.frame df3
structure(list(pred_rate = c(0.72, 11.6092307692308, 0.669090909090909,
2.67636363636364, 18.9753846153846, 0, 0, 5.50153846153846, 7.6,
0.0707692307692308, 5.2, 0.283076923076923, 0.306666666666667,
2.47666666666667, 4.03076923076923, 9.83692307692308, 2.88727272727273,
0, 0, 4.28923076923077, 1.28, 0.752727272727273, 17.7569230769231,
0.849230769230769, 0, 0, 0, 6.65230769230769, 0.153333333333333,
6.532, 2.15692307692308, 4.06153846153846, 0.750769230769231,
0.167272727272727, 4.27692307692308, 16.7309090909091, 0, 1.37230769230769,
0.345, 8.6, 2.06666666666667, 6.76, 1.6, 0, 0.353846153846154,
0.501818181818182, 3.37333333333333, 0.0307692307692308, 3.36307692307692,
0.131428571428571, 0, 0.153333333333333, 13.8615384615385, 10.9692307692308,
2.52, 0.265, 19.3476923076923, 0.44, 0.141538461538462, 1.51384615384615,
0.184, 0.283076923076923, 4.48, 1.4, 0.306666666666667, 7.65846153846154,
10.3507692307692, 2.32, 2.92923076923077, 0, 0.849230769230769,
0, 0.6, 7.02181818181818, 1.49538461538462, 1.37538461538462,
0.683076923076923, 4.4, 2.65090909090909, 1.244, 26.9507692307692,
0, 5.24307692307692, 0, 3.10769230769231, 0, 0.581538461538462,
0.4), state = c("IDAHO", "IDAHO", "WASHINGTON", "OREGON", "MONTANA",
"IDAHO", "MONTANA", "IDAHO", "IDAHO", "IDAHO", "IDAHO", "IDAHO",
"MONTANA", "IDAHO", "IDAHO", "MONTANA", "MONTANA", "WASHINGTON",
"OREGON", "IDAHO", "IDAHO", "WASHINGTON", "IDAHO", "MONTANA",
"OREGON", "OREGON", "WASHINGTON", "IDAHO", "MONTANA", "WASHINGTON",
"MONTANA", "IDAHO", "MONTANA", "WASHINGTON", "IDAHO", "MONTANA",
"MONTANA", "MONTANA", "OREGON", "IDAHO", "OREGON", "IDAHO", "MONTANA",
"OREGON", "MONTANA", "WASHINGTON", "OREGON", "IDAHO", "MONTANA",
"OREGON", "OREGON", "IDAHO", "IDAHO", "MONTANA", "MONTANA", "IDAHO",
"MONTANA", "MONTANA", "MONTANA", "MONTANA", "OREGON", "WASHINGTON",
"MONTANA", "IDAHO", "WASHINGTON", "MONTANA", "MONTANA", "MONTANA",
"MONTANA", "IDAHO", "MONTANA", "WASHINGTON", "WASHINGTON", "WASHINGTON",
"MONTANA", "MONTANA", "IDAHO", "MONTANA", "OREGON", "OREGON",
"IDAHO", "WASHINGTON", "OREGON", "OREGON", "IDAHO", "WASHINGTON",
"MONTANA", "WASHINGTON"), Federal_land = c(42.534151492934, 63.9784662986647,
16.6003160782432, 51.1028955500689, 58.947487328762, 9.06180914439123,
0.911130670014708, 76.9081756955196, 73.1858648432361, 39.5200226761674,
48.8486328184087, 60.7062703076754, 35.0609330068159, 63.2475807326415,
64.3067163139122, 43.5861065699575, 12.4708493770657, 79.1564164555182,
51.2009118836618, 61.6880505872881, 50.5134498548557, 28.5596173825025,
92.9698127308114, 45.4524774304155, 75.2608706829412, 50.0736659471988,
4.57522270679354, 69.066464511745, 17.0816340292832, 33.9771447677684,
72.3480726901471, 58.9285919495093, 43.6772079658238, 20.7664717977071,
36.9601044858487, 20.7302226596209, 4.31711620704182, 63.9399492882534,
61.074079782981, 83.423569139067, 51.016535147847, 29.5419594948999,
52.2630701694041, 27.6580391015359, 25.9703952858087, 34.142779076987,
4.0917069e-07, 30.2891363697992, 17.0896884679476, 73.1092996462389,
56.4327951729983, 15.8206520127824, 90.3193622408296, 48.5392976606421,
74.4951345121614, 19.7985864424376, 46.6355884021095, 31.9051614956376,
81.6803761570734, 54.5946431095638, 11.4225705007466, 46.0670532397187,
53.2256541508368, 24.6637518111356, 58.0002021332745, 10.412284283733,
50.4107537544319, 73.7125077034306, 51.7314184202192, 74.9360369140056,
51.1361611998182, 40.779585438193, 1.87434204580111, 19.5796225789176,
17.6532173966671, 24.695272565077, 33.809439417054, 17.4984242308043,
20.8413651931395, 47.821008360042, 86.0044585359916, 2.10073799490256,
58.2991711104475, 16.7330858857592, 36.7384554719021, 54.4145335580903,
7.26384475698267, 0.659296220775652), Lethal_removal = c(3L,
67L, 1L, 12L, 195L, 0L, 0L, 43L, 67L, 1L, 18L, 0L, 1L, 11L, 35L,
8L, 3L, 0L, 0L, 16L, 116L, 2L, 100L, 5L, 0L, 0L, 0L, 61L, 0L,
19L, 22L, 19L, 0L, 0L, 10L, 61L, 0L, 15L, 0L, 94L, 0L, 5L, 23L,
0L, 1L, 0L, 0L, 4L, 57L, 0L, 0L, 4L, 84L, 136L, 43L, 9L, 99L,
0L, 8L, 16L, 0L, 0L, 9L, 0L, 0L, 9L, 55L, 30L, 51L, 2L, 8L, 0L,
0L, 14L, 5L, 0L, 2L, 7L, 2L, 0L, 86L, 0L, 11L, 0L, 10L, 0L, 2L,
0L), subtotal = c(10L, 13L, 11L, 11L, 13L, 13L, 13L, 13L, 13L,
13L, 11L, 13L, 12L, 12L, 13L, 13L, 11L, 10L, 3L, 13L, 13L, 11L,
13L, 13L, 1L, 4L, 1L, 13L, 12L, 10L, 13L, 13L, 13L, 11L, 13L,
11L, 12L, 13L, 8L, 13L, 9L, 2L, 13L, 2L, 13L, 11L, 9L, 13L, 13L,
7L, 4L, 12L, 13L, 13L, 13L, 8L, 13L, 13L, 13L, 13L, 5L, 13L,
13L, 1L, 12L, 13L, 13L, 13L, 13L, 13L, 13L, 5L, 1L, 11L, 13L,
13L, 13L, 13L, 11L, 10L, 13L, 7L, 13L, 5L, 13L, 4L, 13L, 1L)), row.names = c(NA,
-88L), class = "data.frame")
emmip
by default returns predictions on the link scale - predict_response()
always returns predictions on the response scale. That's why the axis limits are so different. If you use type = "response"
, you get the same, large axis range:
emmeans::emmip(
model,
state ~ pred_rate,
type = "response",
cov.reduce = range,
offset = log(1),
CIs = TRUE,
plotit = FALSE
)
So, the y-axis of the plot from predict_response()
indicates average predicted counts / incidences, the y-axis from emmip()
is by default the log of average predicted counts.
Some closer look at different ranges / subsets of the data. As you can see, it seems there are not many data points and probably some outliers for Oregon, which a) lead to spuriously high predicted values and b) large confidence intervals.
library(MASS)
library(ggeffects)
model <- glm(Lethal_removal ~ Federal_land + state / pred_rate + offset(log(subtotal)), data = df3, family = negative.binomial(1), na.action = na.fail)
# only three states, as Oregon has extreme values - furthermore, limit
# the range of pred_rate to 0:7 - higher values have larger confidence intervals
pr <- predict_response(model, terms = c("pred_rate [0:7]", "state [IDAHO, WASHINGTON, MONTANA]"))
plot(pr, show_ci = FALSE)
# include all states
pr <- predict_response(model, terms = c("pred_rate [0:7]", "state"))
plot(pr, show_ci = FALSE)
# now add CI
pr <- predict_response(model, terms = c("pred_rate [0:7]", "state [IDAHO, WASHINGTON, MONTANA]"))
plot(pr, show_ci = TRUE)
When you check model assumptions (e.g. using performance::check_model()
, you see that you still seem to have some over dispersion, and that the data has more zeros than expected by your model. I assume your plots are not only a matter of log-scaled axis or not, but also about a) finding the right model and b) checking if you need to select a subset of your data so you have enough data values for prediction (or: limit the range of predictions for pred_rate
).