rggplot2rstudiogeom

Negative values in linear trend and Confidence interval in R


Firstly, the code used.

ggplot(correlation, aes(x=area_ha, y=extent_2000_ha)) 
+ geom_point( color="green") + theme_ipsum() 
+ theme(text=element_text(family="Times New Roman", size=14)) 
+ scale_y_continuous(labels=function(n){format(n, scientific = FALSE)})
+ scale_y_continuous(labels=scales::comma)
+ geom_smooth(method=lm, color="red", se=FALSE) 

When I want to put the linear trend and also the confidence interval (attached graphs), appears negative values on OY (-200.000). All the values are positive. No negative values.

enter image description here

enter image description here


Solution

  • If it only makes sense for your regression line to be strictly positive, then a standard linear regression is just not the right model for your data. A linear regression will simply find the line which minimizes the squared distances from your data to the line. It does not care if this means the line becomes negative where you think it shouldn't be. This is an extra constraint that you need to build into your model, and this is dependent on the context and phyical interpretation of your data (which we can only guess at from the information in your question).

    For example, you could consider a linear regression with a fixed intercept of 0:

    ggplot(correlation, aes(x = area_ha, y = extent_2000_ha)) +
      geom_point( color = "green4", alpha = 0.2) + 
      theme_ipsum() + 
      theme(text=element_text(family="Times New Roman", size=14)) + 
      scale_y_continuous(labels = scales::comma) +
      scale_x_continuous(labels = scales::comma, limits = c(0, 8e5)) +
      geom_smooth(method = "lm", color = "red3", formula = y ~ x + 0,
                  fullrange = TRUE, alpha = 0.2) 
    

    enter image description here

    Or perhaps a generalised linear model with a log-link function:

    ggplot(correlation, aes(x = area_ha, y = extent_2000_ha)) +
      geom_point( color = "green4", alpha = 0.2) + 
      theme_ipsum() + 
      theme(text=element_text(family="Times New Roman", size=14)) + 
      scale_y_continuous(labels = scales::comma) +
      scale_x_continuous(labels = scales::comma, limits = c(0, 8e5)) +
      geom_smooth(method = "glm", color = "red", method.args = list(family = poisson),
                  fullrange = TRUE) 
    

    enter image description here

    As for which model is best, that is a statistics question rather than a programming question. It's therefore off-topic here, but may be on topic at CrossValidated.


    Data used

    There was no data included in the question, so I created a similar dataset from the following code to create the above examples

    library(ggplot2)
    library(hrbrthemes)
    
    set.seed(2)
    correlation <- data.frame(area_ha = runif(100, 1, 8e5))
    correlation$extent_2000_ha <- (correlation$area_ha + 
                                     rnorm(100, 0, correlation$area_ha))^2/9e6
    correlation <- correlation[correlation$extent_2000_ha > 1e4,]
    correlation <- correlation[correlation$extent_2000_ha < 1e6,]
    correlation <- round(correlation)