rpredictionpredictpredictive

How to properly use the predict function in R


First I'm going to give you some starter code:

library(ggplot2)

y = c(0, 0, 1, 2, 0,  0, 1,  3,  0,  0,  3, 0, 6, 2, 8, 16, 21, 39, 48, 113, 92, 93 ,127, 159, 137, 46, 238, 132 ,124, 185 ,171, 250, 250 ,187, 119 ,151, 292,  94, 281, 146, 163 ,104, 156, 272, 273, 212, 210, 135, 187, 208, 310, 276 ,235, 246, 190, 232, 254, 446,
314, 402 ,276, 279, 386 ,402, 238, 581, 434, 159, 261, 356, 440, 498, 495, 462 ,306, 233, 396, 331, 418, 293 ,431 ,300, 222, 222, 479 ,501, 702
,790, 681)
x = 1:length(y)

Now, I'm trying to predict the 90th data point will be using polynomial regression, wherein the data, #1 is 0, and #89 is 681. I've tested my model and I've decided that a polynomial curve to the 8th degree is the perfect fit.

I've tried the code predict(formula=y~poly(x,8),90) and it's giving some strange error (which doesn't make sense to me) about how there is no applicable method.

Error in UseMethod("predict") : 
  no applicable method for 'predict' applied to an object of class "c('double', 'numeric')"

Why doesn't this work? After scouring countless R documentations, blogs and forums, it seemed to me that this should work properly.

What does work, instead? I've tried other ways of using the predict method, and I think that this is the closest solution to what I want: The predicted value for the 90th data point.

Any other suggestions? I'm not sure that my model is the best, and I would welcome any suggestions you may have. For example, you may argue that it's better to use a 6th degree than an 8th degree polynomial for modeling, and if you have a valid reason, I would agree with you.

Thank you!

NOTE: Please, PLEASE don't remove the thanks. I know some Stack Overflowers hate it, but I feel it gives a personal touch.


Solution

  • predict works on models. You have a formula, but not a model. You need to fit a model first, and then predict on that.

    Usually this is done in two steps, because usually people want to save the model so it can be used for more than just a single prediction - perhaps to examine coefficients, check assumptions, get model fit diagnostics, make a different prediction - without re-fitting the model.

    Here I'll use the simplest model that can take your formula, lm, which stands for "linear model". You could also use a GLM, or loess, or a random forest, a GAM, a neural net, or ... many many many different models.

    my_model = lm(formula=y~poly(x,8))
    predict(my_model, newdata = list(x = 90))
    #        1 
    # 977.9421 
    

    You could, of course, combine this into a single line, never bothering to save and name my_model:

    predict(lm(formula=y~poly(x,8)), newdata = list(x = 90))
    

    I'm not sure that my model is the best,

    It's not. Almost certainly. But that's okay - it's very hard to know that a model is best in any sense of the word.

    and I would welcome any suggestions you may have. For example, you may argue that it's better to use a 6th degree than an 8th degree polynomial for modeling,

    I don't think I've ever seen an 8th degree polynomial used. (Or even 6th.) It's absurdly high. I have no idea what your data is, so I can't say much. If you have a reason to think that 8th degree polynomial is accurate, then go for it. But if you just want to fit a wiggly curve and extrapolate forward a tiny bit, then a cubic spline using mgcv::gam or a stats::loess model would be a much more standard choice.