pandas

Pandas value_counts(sort=False) with large series doesn't work


By default Series.values_counts is sorted by the count, in descending order:

In [192]: pd.Series([3,0,2,0,0,1,0,0,0,1,1,0,1,0,2,2,2,2,2,0,0,2]).value_counts()
Out[192]: 
0    10
2     7
1     4
3     1
dtype: int64

If I pass sort=False, it appears to try and sort by the value key instead:

In [193]: pd.Series([3,0,2,0,0,1,0,0,0,1,1,0,1,0,2,2,2,2,2,0,0,2]).value_counts(sort=False)
Out[193]: 
0    10
1     4
2     7
3     1
dtype: int64

However when I increase the length of the series, the sorting reverts to the original order:

In [194]: pd.Series([3,0,2,0,0,1,0,0,0,1,1,0,1,0,2,2,2,2,2,0,0,2]*100).value_counts(sort=False)
Out[194]: 
0    1000
2     700
1     400
3     100
dtype: int64

Any ideas what's going on here?


Solution

  • This is correct. You asked .value_counts() not to sort the result, so it doesn't. Below I emulate what sort=True actually does, which is simply a sort_values. If you don't sort, then you will get the result of the counts which is done by a hash table and consequently is in an arbitrary order.

    In [39]: pd.Series([3,0,2,0,0,1,0,0,0,1,1,0,1,0,2,2,2,2,2,0,0,2]).value_counts(sort=False).sort_values(ascending=False)
    Out[39]: 
    0    10
    2     7
    1     4
    3     1
    dtype: int64
    
    In [40]: pd.Series([3,0,2,0,0,1,0,0,0,1,1,0,1,0,2,2,2,2,2,0,0,2]*100).value_counts(sort=False).sort_values(ascending=False)
    Out[40]: 
    0    1000
    2     700
    1     400
    3     100
    dtype: int64