Could someone please explain how percentiles are calculated by describe() method.
Different sources explained this using different approaches. What is the exact way of calculation?
For example, consider the following code:
l=[10,13,15,19,21,25]
s=pd.Series(l)
s.describe()
Output is:
count 6.000000
mean 17.166667
std 5.528713
min 10.000000
**25% 13.500000**
50% 17.000000
75% 20.500000
max 25.000000
Could someone please explain how 25%(Q1) is calculated ?
The df.quantile() function has several parameters that can be used to calculate the quantiles in different Interpolations. for df.describe() the default used appears to be linear as seen here:
s.quantile([.25, .5, .75], interpolation="linear")
which gives:
0.25 13.5
0.50 17.0
0.75 20.5
you can also use:
s.quantile([.25, .5, .75], interpolation="nearest")
to get:
0.25 13
0.50 15
0.75 21
"nearest" was the one I was expecting too. With respect to the doc, the formula for linear is:
linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.