javaexcelapache-commons-math

Percentile calculation mismatch using apache.math3.stat.descriptive


I am calculating the 95th percentile of the following list of numbers:

66,337.8,989.7,1134.6,1118.7,1097.9,1122.1,1121.3,1106.7,871,325.2,285.1,264.1,295.8,342.4

The apache libraries use the NIST standards to calculate the percentile which is the same method used by Excel. According to Excel the 95th percentile of the list above should be 1125.85.

However, using the following code I get a different result:

DescriptiveStatistics shortList = new DescriptiveStatistics();



@BeforeTest
@Parameters("shortStatsList")
private void buildShortStatisticsList(String list) {
    StringTokenizer tokens = new StringTokenizer(list, ",");
    while (tokens.hasMoreTokens()) {
        shortList.addValue(Double.parseDouble(tokens.nextToken()));
    }
}

@Test
@Parameters("95thPercentileShortList")
public void percentileShortListTest(String percentile) {
    Assert.assertEquals(Double.toString(shortList.getPercentile(95)), percentile);
}

This fails with the following message:

java.lang.AssertionError: expected:<1125.85> but was:<1134.6>
at org.testng.Assert.fail(Assert.java:89)
at org.testng.Assert.failNotEquals(Assert.java:489)

1134.6 is the maximum value in the list, not the 95th percentile, so I don't know where this value is coming from.


Solution

  • According to the documentation of getPercentile() it is using the percentile estimation algorithm, as recorded here.

    Percentiles can be estimated from N measurements as follows: for the pth percentile, set p(N+1) equal to k+d for k an integer, and d, a fraction greater than or equal to 0 and less than 1.

    1. For 0<k<N,  Y(p)=Y[k]+d(Y[k+1]−Y[k])

    2. For k=0,  Y(p)=Y[1]

    Note that any p ≤ 1/(N+1) will simply be set to the minimum value.

    1. For k≥N,Y(p)=Y[N]

      Note that any p ≥ N/(N+1) will simply be set to the maximum value.

    Basically this means multiplying the requested percentile (0.95) by (N+1). In your case N is 15, and N+1 is 16, so you get 15.2.

    You split this into the whole part k (15), and d (0.2). The k falls into category 3 above. That is, the estimated percentile is the maximum value.


    If you keep on reading the NIST article that I linked above, you'll see the part titled "Note that there are other ways of calculating percentiles in common use". They refer you to an article by Hyndman & Fann, which describes several alternative ways of calculating percentiles. It's a misconception to thing that there is one NIST method. The methods in Hyndman & Fann are denoted by the labels R1 through R9. The article goes on to say:

    Some software packages set 1+p(N−1) equal to k+d and then proceed as above. This is method R7 of Hyndman and Fan. This is the method used by Excel and is the default method for R (the R quantile function can optionally use any of the nine methods discussed in Hyndman & Fan).

    The method used by default by Apache's DescriptiveStatistics is Hyndman & Fan's R6. The method used by Excel is R7. Both of them are "NIST methods", but for a small number of measurements, they can give different results.

    Note that the Apache library does allow you to use the R7 algorithm or any of the others, by using the Percentile class. Something like this should do the trick:

    DescriptiveStatistics shortList = new DescriptiveStatistics();
    shortList.setPercentileImpl( new Percentile().
                                     withEstimationType( Percentile.EstimationType.R_7 ) );
    

    (Note that I haven't tested this).