rcurve-fittingexponential

Having trouble fitting an exponential decay model to these data in R:


  1. My problem: I have about 300 rows of data (more variables than the two attached, but only relevant on my end) and need to predict "xaq" (Y) from "seed_prev_num" (x). Their relationship seems to be an exponential decay, but all my attempts with NLS have yielded alternating errors of Missing value or an infinity produced or singular gradient matrix at initial parameter estimates.

Here is the graphical relationship between the two variables:enter image description here

  1. I have tried: log-transform, multiple starting functions, NLS parameters (max of y and different values for b in the exponential function), spline regression and splitting into "higher" and "lower" datasets to model each separately. The log-transform model performs OK, but loses some power in the middle cases, which unfortunately for our business rules, is where the bulk of our data will be arriving.

  2. Here's some code output from dput() to get you started, please let me know if you have questions below!

structure(list(seed_prev_num = c(2671089.52136, 1723563.978132, 
1519478.513963, 10359757.487541, 435039.876186, 400860.993997, 
4824556.66506, 7345496.18374, 7167489.404247, 1541803.448572, 
802314.685373, 2589889.980437, 1578945.817656, 593092.510586, 
9524646.88053, 1517305.29024, 12332649.496439, 1225895.745565, 
1723563.978132, 4029354.348235, 486209.416573, 2473918.859946, 
2200685.368227, 1687211.87222, 34608587.788775, 127034.804899, 
2174606.683551, 694443.762395, 1297414.562631, 269677.307445), 
    xaq = c(3.140014793, 9.363136176, 10.842283188, 2.495966588, 
    6.017711172, 3.124199113, 3.369492219, 2.702313072, 2.587612668, 
    3.000256279, 2.360256095, 4.987947212, 7.002252252, 16.762824783, 
    2.35521676, 10.671484375, 3.007449177, 7.73650282, 4.812929849, 
    2.976955136, 17.230394149, 4.272320716, 5.298590538, 6.414285714, 
    2.321626944, 9.362363919, 3.303806668, 10.004836415, 8.26450434, 
    5.726739927)), row.names = c(NA, -30L), class = c("tbl_df", 
"tbl", "data.frame"))

Solution

  • These all seem to work fine:

    log-linear fit

    m1 <- lm(log(xaq) ~ seed_prev_num, dd)
    

    This assumes that the deviation is log-normal, i.e. that log(xaq) is Normally distributed with a constant variance. I don't understand what you mean by "loses some power in the middle cases" ...

    log-link Gaussian GLM

    This fits the model y ~ Normal(exp(a+b*x), sigma)

    m2 <- glm(xaq ~ seed_prev_num,  family = gaussian(link = "log"), dd)
    

    nls with log-link GLM starting values

    This fits the same model.

    m3 <- nls(xaq ~ a*exp(b*seed_prev_num),
              data = dd,
              start = list(a = exp(coef(m2)[1]), b = coef(m2)[2]))
    

    These are both TRUE:

    all.equal(coef(m2)[["(Intercept)"]], log(coef(m3)[["a"]]), tolerance = 1e-5)
    all.equal(coef(m2)[["seed_prev_num"]], coef(m3)[["b"]], tolerance = 1e-5)
    

    This trick is probably all you need. It also works if you use the starting values from the log-linear fit (start = list(a = exp(coef(m1)[1]), b = coef(m1)[2]))).

    nls with arbitrary starting values but scaled x

    I suspect that the primary problem is the scaling of your x variable ...

    m4 <- nls(xaq ~ a*exp(b*seed_prev_num/1e7),
              data = dd,
              start = list(a = 1, b = 1))
    
    all.equal(coef(m4)[["b"]]/1e7, coef(m3)[["b"]],
         tolerance = 1e-4)
    

    The fits don't get much better: your management may have to live with the fact that these are noisy data.

    m6 <- nls(xaq ~ a*exp(b*(seed_prev_num/1e7)^c),
              data = dd,
              start = list(a = 9, b = -2, c =  1.1)
    )
    

    does slightly better (increases the squared correlation between predictions and observed, aka one form of R^2, to 0.34) but is not a better model according to AIC ...