pythonpandasstandard-deviationpandas-rolling

Why does Pandas rolling std output different results for different tail sizes?


Output of pandas rolling std is not consistent when data size changes. I tried to apply rolling std for different slices of the data, while all of them are at the tail of the series. I observed that the result for s.tail(3) is different from that of s.tail(4) and s.tail(5) -

>>> s = pd.Series(np.random.default_rng(seed=123).random(size=5))
>>> s[1] = 10000000  # just a large number
>>> s
0    6.823519e-01
1    1.000000e+07
2    2.203599e-01
3    1.843718e-01
4    1.759059e-01
dtype: float64
>>> s.tail(3).rolling(window=3, min_periods=1).std()
2         NaN
3    0.025447
4    0.023604
dtype: float64
>>> s.tail(4).rolling(window=3, min_periods=1).std()
1             NaN
2    7.071068e+06
3    5.773503e+06
4    0.000000e+00
dtype: float64
>>> s.tail(5).rolling(window=3, min_periods=1).std()
0             NaN
1    7.071067e+06
2    5.773502e+06
3    5.773503e+06
4    0.000000e+00
dtype: float64

In comparison, if I apply pd.Series.std to the rolling windows, the last result would be consistent:

>>> s.tail(3).rolling(window=3, min_periods=1).apply(pd.Series.std)
2         NaN
3    0.025447
4    0.023604
dtype: float64
>>> s.tail(4).rolling(window=3, min_periods=1).apply(pd.Series.std)
1             NaN
2    7.071068e+06
3    5.773503e+06
4    2.360426e-02
dtype: float64
>>> s.tail(5).rolling(window=3, min_periods=1).apply(pd.Series.std)
0             NaN
1    7.071067e+06
2    5.773502e+06
3    5.773503e+06
4    2.360426e-02
dtype: float64

What is causing pandas' rolling standard deviation calculation on the same windows size 3 results in different outcomes?

Pandas version: 2.2.2 (also tried 2.3.0) on Windows


Solution

  • The rolling window is anchored to the end of the Series.

    This means for a Series of 5 items a, b, c, d, e and a window of 3, the computation will be:

    std(a)          # 0         NaN
    std(a, b)       # 1    0.444438
    std(a, b, c)    # 2    0.325633
    std(b, c, d)    # 3    0.087630
    std(c, d, e)    # 4    0.023604
    

    If you first select items with tail(3), you'll have:

    std(c)          # 2         NaN   #
    std(c, d)       # 3    0.025447   # this is different
    std(c, d, e)    # 4    0.023604
    

    And std(c, d) != std(b, c, d).

    In short, the first n-1 items of your output will be affected for a window size of n.

    Edited example

    When you change the second number to 10000000, you will create a intermediate array that cannot represent all values as floating point numbers (float64) faithfully in the same array and the small numbers become zero. This is due to the taking the square of the differences to the mean.

    You can observe that there is no issue if s[1] = 1000000 (one less zero):

    s = pd.Series(np.random.default_rng(seed=123).random(size=5))
    s[1] = 1000000
    s.rolling(window=3, min_periods=1).std()
    
    0              NaN
    1    707106.298691
    2    577350.008599
    3    577350.152354
    4         0.021852
    dtype: float64
    

    The difference is more obvious if you compute the variance:

    s.rolling(window=3, min_periods=1).var()
    
    0             NaN
    1    4.999993e+11
    2    3.333330e+11
    3    3.333332e+11
    4    4.775168e-04  # difference of 15 powers of ten
    dtype: float64
    

    According to the source code, Pandas calculates the rolling variance(std^2) online, and the sum of squared deviation from mean (ssqdm_x) is one of the variable maintained by this running variance algorithm. When the rolling window passed the large integer in the example given, the remove_var function will subtract a large number (val - prev_mean) * (val - mean_x[0]) from the original ssqdm_x. However, since the original ssqdm_x is already a imprecise large number, the algorithm would not restore the correct small value. And sometime the ssqdm_x value after subtraction could be huge, which causes the contribution of later smaller values in the series be negligible, and results in nearly zero variance.