I'm working with the qqnorm function in R to generate theoretical quantiles from a dataset of integers. I expected the quantiles calculated by qqnorm and those calculated manually using the quantile formula to be identical for any size of the dataset. However, I found that this is only true for n = 1 and n > 10. Here’s the code I used:
# Define the function to compare quantiles
compare_quantiles <- function(n) {
data <- 1:n
qq_result <- qqnorm(data, plot = FALSE)
theoretical_quantiles_qq_norm <- qq_result$x
theoretical_quantiles_by_hand <- qnorm((1:length(data) - 0.5) / length(data))
# Check if the quantiles are identical
identical_results <- all.equal(theoretical_quantiles_qq_norm, theoretical_quantiles_by_hand)
return(identical_results)
}
# Check for n = 1 to n = 100
results <- sapply(1:100, compare_quantiles)
If you look at the code for qqnorm
, you'll find the line
x <- qnorm(ppoints(n))[order(order(y))]
So rather than using (1:length(data) - 0.5) / length(data)
as you do, it uses this helper function. The ppoints
function is defined as
function (n, a = if (n <= 10) 3/8 else 1/2)
{
if (length(n) > 1L)
n <- length(n)
if (n > 0)
(1L:n - a)/(n + 1 - 2 * a)
else numeric()
}
If you bring up the helper function documentation with ?ppoints
. It points out
The choice of
a
follows the documentation of the function of the same name in Becker et al (1988), and appears to have been motivated by results from Blom (1958) on approximations to expect normal order statistics (see also quantile).
So the function is designed to work differently for values of N less that 10 presumably because of the better statistical properties described in the referenced sources. Presumably an adjustment for small sample sizes.