pandasskew

Why does pandas.DataFrame.skew() return 0 when the SD of a list of values is 0?


Background

Let's think, there is a list of values which presents activity of a person for several hours. That person did not have any movement in those hours. Therefore, all the values are 0.

What did raise the question?

Searching on Google, I found the following formula of skewness. The same formula is available in some other sites also. In the denominator part, Standard Deviation (SD) is included. For a list of similar non-zero values (e.g., [1, 1, 1]) and also for 0 values (i.e., [0, 0, 0]), the SD will be 0. Therefore, I am supposed to get NaN (something divided by 0) for skewness. Surprisingly, I get 0 while calling pandas.DataFrame.skew(). enter image description here

My Question

Why does pandas.DataFrame.skew() return 0 when the SD of a list of values is 0?


Minimum Reproducible Example

import pandas as pd
ot_df = pd.DataFrame(data={'Day 1': [0, 0, 0, 0, 0, 0],
                           'Day 2': [0, 0, 0, 0, 0, 0],
                           'Day 3': [0, 0, 0, 0, 0, 0]})
print(ot_df.skew(axis=1))

Note: I have checked several Q&A of this site (e.g., this one (How does pandas calculate skew?)) and others (e.g., this one of GitHub). But I did not find the answer of my question.


Solution

  • You can find the implementation here: https://github.com/pandas-dev/pandas/blob/main/pandas/core/nanops.py

    As you can see there is a:

        with np.errstate(invalid="ignore", divide="ignore"):
            result = (count * (count - 1) ** 0.5 / (count - 2)) * (m3 / m2 ** 1.5)
    
        dtype = values.dtype
        if is_float_dtype(dtype):
            result = result.astype(dtype)
    
        if isinstance(result, np.ndarray):
            result = np.where(m2 == 0, 0, result)
            result[count < 3] = np.nan
        else:
            result = 0 if m2 == 0 else result
            if count < 3:
                return np.nan
    

    As you can see if m2 (which will be equal 0 for all constant values) is 0, then the result will be 0.

    If you are asking why it is implemented this way, I can only speculate. I suppose, that it is done for practical reasons - if you are calculating the skewness you want to check if the distribution of variables is symetrical (and you can argue, that it indeed is: https://stats.stackexchange.com/questions/114823/skewness-of-a-random-variable-that-have-zero-variance-and-zero-third-central-mom).

    EDIT: It was done due to: https://github.com/pandas-dev/pandas/issues/11974 https://github.com/pandas-dev/pandas/pull/12121

    Probably you could add an issue for adding a flag on behaviour of this method in case of constant value of variable. It should be easy to fix.