pythonpandasnumpy

Panda's value_counts() method counting missing values inconsistently


Please consider this simple dataframe:

df = pd.DataFrame({'x': [1, 2, 3, 4, 10]}, index = range(5))

df:
    x
0   1
1   2
2   3
3   4
4   10

Some indices:

ff_idx = [1, 2]

sd_idx= [3, 4]

One way of creating a new column by filtering df based on the above indices:

df['ff_sd_indicator'] = np.nan
df['ff_sd_indicator'][df.index.isin(ff_idx)] = 'ff_count' 
df['ff_sd_indicator'][df.index.isin(sd_idx)] = 'sd_count' 

Another way of doing the same thing:

df['ff_sd_indicator2'] = np.select([df.index.isin(ff_idx) , df.index.isin(sd_idx)], ['ff_count','sd_count' ], default=np.nan)

Notice that while the values of ff_sd_indicator and ff_sd_indicator2 are naturally the same, the missing values are printed differently (NaN vs nan):

df: 

    x   ff_sd_indicator ff_sd_indicator2
0   1   NaN         nan
1   2   ff_count    ff_count
2   3   ff_count    ff_count
3   4   sd_count    sd_count
4   10  sd_count    sd_count

I don't care about the different prints, but surprisingly the missing values do not show up in the output of:

df['ff_sd_indicator'].value_counts()

which is:

ff_sd_indicator
ff_count    2
sd_count    2

But they do show up in the output of:

df['ff_sd_indicator2'].value_counts()

which is:

ff_sd_indicator2
ff_count    2
sd_count    2
nan         1

So, what is going on here with value_counts() not counting the missing values in ff_sd_indicator while they were created by the same np.nan as the missing values in ff_sd_indicator2 were created?

Edit: df.info() :

RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   x                 5 non-null      int64 
 1   ff_sd_indicator   5 non-null      object
 2   ff_sd_indicator2  5 non-null      object

Solution

  • By default value_counts drops the NaN, which can be avoided by setting dropna=False:

    df['ff_sd_indicator'].value_counts(dropna=False)
    
    ff_sd_indicator
    ff_count    2
    sd_count    2
    NaN         1
    Name: count, dtype: int64
    

    If you check the output of:

    np.select([df.index.isin(ff_idx) , df.index.isin(sd_idx)],
              ['ff_count','sd_count'], default=np.nan)
    

    You will see however that you don't have a NaN but a string:

    array(['nan', 'ff_count', 'ff_count', 'sd_count', 'sd_count'],
          dtype='<U32')
    

    Thus the value is not dropped automatically.