rstatisticsquantileqqqqplot

Why do theoretical quantiles from qqnorm differ from manual calculation for small datasets?


I'm working with the qqnorm function in R to generate theoretical quantiles from a dataset of integers. I expected the quantiles calculated by qqnorm and those calculated manually using the quantile formula to be identical for any size of the dataset. However, I found that this is only true for n = 1 and n > 10. Here’s the code I used:

# Define the function to compare quantiles
compare_quantiles <- function(n) {
  data <- 1:n
  qq_result <- qqnorm(data, plot = FALSE)
  theoretical_quantiles_qq_norm <- qq_result$x
  theoretical_quantiles_by_hand <- qnorm((1:length(data) - 0.5) / length(data))
  
  # Check if the quantiles are identical
  identical_results <- all.equal(theoretical_quantiles_qq_norm, theoretical_quantiles_by_hand)
  
  return(identical_results)
}
# Check for n = 1 to n = 100
results <- sapply(1:100, compare_quantiles)

Solution

  • If you look at the code for qqnorm, you'll find the line

    x <- qnorm(ppoints(n))[order(order(y))]
    

    So rather than using (1:length(data) - 0.5) / length(data) as you do, it uses this helper function. The ppoints function is defined as

    function (n, a = if (n <= 10) 3/8 else 1/2) 
    {
        if (length(n) > 1L) 
            n <- length(n)
        if (n > 0) 
            (1L:n - a)/(n + 1 - 2 * a)
        else numeric()
    }
    

    If you bring up the helper function documentation with ?ppoints. It points out

    The choice of a follows the documentation of the function of the same name in Becker et al (1988), and appears to have been motivated by results from Blom (1958) on approximations to expect normal order statistics (see also quantile).

    So the function is designed to work differently for values of N less that 10 presumably because of the better statistical properties described in the referenced sources. Presumably an adjustment for small sample sizes.