I stumbled upon a weird behaviour in the windowing functionality in pandas, it seems that a rolling sum operation gives different results depending on the length of the series itself.
Given 2 series:
s1 = pd.Series(np.arange(5), index=range(5)) # s1 = 1, 2, 3, 4
s2 = pd.Series(np.arange(2, 5), index=range(2, 5)) # s2 = 2, 3, 4, 5
We apply a rolling sum on both:
k = 0.1
r1 = (s1 * k).rolling(2).sum().dropna() # r1 = 1, 3, 5, 7
r2 = (s2 * k).rolling(2).sum().dropna() # r2 = 5, 7
# remove values from r1 which are not in r2
r1 = r1[r2.index] # r1 = 5, 7
# now r1 should be exactly the same as r2, let's check the indices:
all(r1.index == r2.index) # => true
However, if we check the values, they are not exactly equal:
r1.iloc[0] == r2.iloc[0] # => false
abs(r1.iloc[0] - r2.iloc[0]) < 0.000000000000001 # => true
abs(r1.iloc[0] - r2.iloc[0]) < 0.0000000000000001 # => false
I am aware that floating point operations are not exact, and I don't think the observed behaviour is a bug.
However, I would assume, that the same deterministic calculations within the window(s) are applied to both series, so I would expect that the results to be exactly the same.
I am curious as to what exactly is causing this behaviour in the implementation of the window operation.
I think it has to do with numpy.sum rather than rolling
or the series length.
For floating point numbers the numerical precision of sum (and
np.add.reduce
) is in general limited by directly adding each number individually to the result causing rounding errors in every step. However, often numpy will use a numerically better approach (partial pairwise summation) leading to improved precision in many use-cases. This improved precision is always provided when noaxis
is given. Whenaxis
is given, it will depend on which axis is summed. Technically, to provide the best speed possible, the improved precision is only used when the summation is along the fast axis in memory. Note that the exact precision may vary depending on other parameters. In contrast to NumPy, Python'smath.fsum
function uses a slower but more precise approach to summation. Especially when summing a large number of lower precision floating point numbers, such asfloat32
, numerical errors can become significant. In such cases it can be advisable to usedtype="float64"
to use a higher precision for the output.
import math
import pandas as pd
import numpy as np
k = .1
s1 = pd.Series(np.arange(5), index=range(5))
s2 = pd.Series(np.arange(2, 5), index=range(2, 5))
r1_new = s1.mul(k).rolling(2).agg(math.fsum).dropna()
r2_new = s2.mul(k).rolling(2).agg(math.fsum).dropna()
r1_new.iloc[2:] == r2_new # --> True, True
In this very specific case, less precision seems to work better - i.e., float32
rather than float64
k = .1
s1 = pd.Series(np.arange(5), index=range(5))
s2 = pd.Series(np.arange(2, 5), index=range(2, 5))
r1_new = s1.mul(k).astype(np.float32).rolling(2).sum().dropna()
r2_new = s2.mul(k).astype(np.float32).rolling(2).sum().dropna()
r1_new.iloc[2:] == r2_new # --> True, True