This is my data:
x = c(2.42, 2.59, 3.5, 2.75, 2.78, 3.58, 2.95, 2.06, 2.36, 2.48, 3.33, 2.89)
I got this qq plot from R:
qqnorm(x)
qqline(x)
Meanwhile, for the same data, I got the following qq plots with different settings from Python:
import pandas as pd
x = [2.42, 2.59, 3.5, 2.75, 2.78, 3.58, 2.95, 2.06, 2.36, 2.48, 3.33, 2.89]
df = pd.DataFrame(x, columns=['x'])
import statsmodels.api as sm
import matplotlib.pyplot as plt
sm.qqplot(df['x'], line='s')
plt.show()
sm.qqplot(df['x'], line='r')
plt.show()
sm.qqplot(df['x'], line='q')
plt.show()
You can see that the qq lines are all slightly different. Which qq plot/qq line should I rely on? Look forward to any information. Thanks.
tl;dr If you want R and Python to match, use a = 1/2
to adjust the x-axis offset slightly in your Python plots.
All of these plots are only slightly different. Do any of them lead you to different qualitative conclusions about what's going on with your data??
The closest thing to R's qqnorm()
/qqline()
is line = "q"
:
‘qqline’ adds a line to a “theoretical”, by default normal, quantile-quantile plot which passes through the ‘probs’ quantiles, by default the first and third quartiles.
After quite a bit of digging I discovered that the quantile definitions used here by Python and R do match (qqline()
uses type = 7
by default, which matches the definition of scipy.stats.scoreatpercentile
, which is used internally by Python's qqplot). What differs is the location of the x-axis plotting points: Python uses an internal function defined as
(np.arange(1.,nobs+1) - a)/(nobs- 2*a + 1)
with an offset parameter a
set to 0 by default, where R uses the ppoints()
function, which is defined similarly but with a different default a
value: if(n <= 10) 3/8 else 1/2)
(see ?ppoints
). Thus, you can get Python to match R's default if you use
sm.qqplot(x, line='q', a = 1/2)