pythonpandas

Why std(skipna=False) and std(skipna=True) yield different results even when there are no NaN or null values in the Series?


I have a pandas Series s, and when I call s.std(skipna=True) and s.std(skipna=False) I get different results even when there are no NaN/null values in s, why? Did I misunderstand the skipna parameter? I'm using pandas 1.3.4

import pandas as pd

s = pd.Series([10.0]*4800000, index=range(4800000), dtype="float32")

# No NaN/null in the Series
print(s.isnull().any()) # False
print(s.isna().any()) # False

# Why the code below prints different results?
print(s.std(skipna=False)) # 0.0
print(s.std(skipna=True)) # 0.61053276

Solution

  • This is an issue with the Bottleneck optional dependency, used to accelerate some NaN-related routines. I think the wrong result happens due to loss of precision while calculating the mean, since Bottleneck uses naive summation, while NumPy uses more accurate pairwise summation.

    You can disable Bottleneck with

    pd.set_option('compute.use_bottleneck', False)
    

    to fall back to the NumPy handling.