python pandas standard-deviation pandas-rolling

Why does Pandas rolling std output different results for different tail sizes?

Output of pandas rolling std is not consistent when data size changes. I tried to apply rolling std for different slices of the data, while all of them are at the tail of the series. I observed that the result for s.tail(3) is different from that of s.tail(4) and s.tail(5) -

>>> s = pd.Series(np.random.default_rng(seed=123).random(size=5))
>>> s[1] = 10000000  # just a large number
>>> s
0    6.823519e-01
1    1.000000e+07
2    2.203599e-01
3    1.843718e-01
4    1.759059e-01
dtype: float64
>>> s.tail(3).rolling(window=3, min_periods=1).std()
2         NaN
3    0.025447
4    0.023604
dtype: float64
>>> s.tail(4).rolling(window=3, min_periods=1).std()
1             NaN
2    7.071068e+06
3    5.773503e+06
4    0.000000e+00
dtype: float64
>>> s.tail(5).rolling(window=3, min_periods=1).std()
0             NaN
1    7.071067e+06
2    5.773502e+06
3    5.773503e+06
4    0.000000e+00
dtype: float64

In comparison, if I apply pd.Series.std to the rolling windows, the last result would be consistent:

>>> s.tail(3).rolling(window=3, min_periods=1).apply(pd.Series.std)
2         NaN
3    0.025447
4    0.023604
dtype: float64
>>> s.tail(4).rolling(window=3, min_periods=1).apply(pd.Series.std)
1             NaN
2    7.071068e+06
3    5.773503e+06
4    2.360426e-02
dtype: float64
>>> s.tail(5).rolling(window=3, min_periods=1).apply(pd.Series.std)
0             NaN
1    7.071067e+06
2    5.773502e+06
3    5.773503e+06
4    2.360426e-02
dtype: float64

What is causing pandas' rolling standard deviation calculation on the same windows size 3 results in different outcomes?

Pandas version: 2.2.2 (also tried 2.3.0) on Windows

Solution

The rolling window is anchored to the end of the Series.

This means for a Series of 5 items a, b, c, d, e and a window of 3, the computation will be:

std(a)          # 0         NaN
std(a, b)       # 1    0.444438
std(a, b, c)    # 2    0.325633
std(b, c, d)    # 3    0.087630
std(c, d, e)    # 4    0.023604

If you first select items with tail(3), you'll have:

std(c)          # 2         NaN   #
std(c, d)       # 3    0.025447   # this is different
std(c, d, e)    # 4    0.023604

And std(c, d) != std(b, c, d).

In short, the first n-1 items of your output will be affected for a window size of n.

Edited example

When you change the second number to 10000000, you will create a intermediate array that cannot represent all values as floating point numbers (float64) faithfully in the same array and the small numbers become zero. This is due to the taking the square of the differences to the mean.

You can observe that there is no issue if s[1] = 1000000 (one less zero):

s = pd.Series(np.random.default_rng(seed=123).random(size=5))
s[1] = 1000000
s.rolling(window=3, min_periods=1).std()

0              NaN
1    707106.298691
2    577350.008599
3    577350.152354
4         0.021852
dtype: float64

The difference is more obvious if you compute the variance:

s.rolling(window=3, min_periods=1).var()

0             NaN
1    4.999993e+11
2    3.333330e+11
3    3.333332e+11
4    4.775168e-04  # difference of 15 powers of ten
dtype: float64

According to the source code, Pandas calculates the rolling variance (std^2) online, and the sum of squared deviation from mean (ssqdm_x) is one of the variable maintained by this running variance algorithm. When the rolling window passed the large integer in the example given, the remove_var function will subtract a large number (val - prev_mean) * (val - mean_x[0]) from the original ssqdm_x. However, since the original ssqdm_x is already an imprecise large number, the algorithm will not restore the correct small value. And the ssqdm_x value after subtraction can be huge, which causes the contribution of later smaller values in the series to be negligible, and results in nearly zero variance.