Output of pandas rolling std is not consistent when data size changes. I tried to apply rolling std for different slices of the data, while all of them are at the tail of the series. I observed that the result for s.tail(3)
is different from that of s.tail(4)
and s.tail(5)
-
>>> s = pd.Series(np.random.default_rng(seed=123).random(size=5))
>>> s[1] = 10000000 # just a large number
>>> s
0 6.823519e-01
1 1.000000e+07
2 2.203599e-01
3 1.843718e-01
4 1.759059e-01
dtype: float64
>>> s.tail(3).rolling(window=3, min_periods=1).std()
2 NaN
3 0.025447
4 0.023604
dtype: float64
>>> s.tail(4).rolling(window=3, min_periods=1).std()
1 NaN
2 7.071068e+06
3 5.773503e+06
4 0.000000e+00
dtype: float64
>>> s.tail(5).rolling(window=3, min_periods=1).std()
0 NaN
1 7.071067e+06
2 5.773502e+06
3 5.773503e+06
4 0.000000e+00
dtype: float64
In comparison, if I apply pd.Series.std
to the rolling windows, the last result would be consistent:
>>> s.tail(3).rolling(window=3, min_periods=1).apply(pd.Series.std)
2 NaN
3 0.025447
4 0.023604
dtype: float64
>>> s.tail(4).rolling(window=3, min_periods=1).apply(pd.Series.std)
1 NaN
2 7.071068e+06
3 5.773503e+06
4 2.360426e-02
dtype: float64
>>> s.tail(5).rolling(window=3, min_periods=1).apply(pd.Series.std)
0 NaN
1 7.071067e+06
2 5.773502e+06
3 5.773503e+06
4 2.360426e-02
dtype: float64
What is causing pandas' rolling standard deviation calculation on the same windows size 3
results in different outcomes?
Pandas version: 2.2.2 (also tried 2.3.0) on Windows
The rolling window is anchored to the end of the Series.
This means for a Series of 5 items a, b, c, d, e and a window of 3, the computation will be:
std(a) # 0 NaN
std(a, b) # 1 0.444438
std(a, b, c) # 2 0.325633
std(b, c, d) # 3 0.087630
std(c, d, e) # 4 0.023604
If you first select items with tail(3)
, you'll have:
std(c) # 2 NaN #
std(c, d) # 3 0.025447 # this is different
std(c, d, e) # 4 0.023604
And std(c, d)
!= std(b, c, d)
.
In short, the first n-1
items of your output will be affected for a window size of n
.
When you change the second number to 10000000, you will create a intermediate array that cannot represent all values as floating point numbers (float64) faithfully in the same array and the small numbers become zero. This is due to the taking the square of the differences to the mean.
You can observe that there is no issue if s[1] = 1000000
(one less zero):
s = pd.Series(np.random.default_rng(seed=123).random(size=5))
s[1] = 1000000
s.rolling(window=3, min_periods=1).std()
0 NaN
1 707106.298691
2 577350.008599
3 577350.152354
4 0.021852
dtype: float64
The difference is more obvious if you compute the variance:
s.rolling(window=3, min_periods=1).var()
0 NaN
1 4.999993e+11
2 3.333330e+11
3 3.333332e+11
4 4.775168e-04 # difference of 15 powers of ten
dtype: float64
According to the source code, Pandas calculates the rolling variance(std^2) online, and the sum of squared deviation from mean (ssqdm_x
) is one of the variable maintained by this running variance algorithm. When the rolling window passed the large integer in the example given, the remove_var
function will subtract a large number (val - prev_mean) * (val - mean_x[0])
from the original ssqdm_x
. However, since the original ssqdm_x
is already a imprecise large number, the algorithm would not restore the correct small value. And sometime the ssqdm_x
value after subtraction could be huge, which causes the contribution of later smaller values in the series be negligible, and results in nearly zero variance.