pythonpandasdataframegroup-by

as_index=False groupBy doesn't work with count


When I run groupby on a dataframe with as_index set to False, count seems to not work at all. For example,

import pandas as pd

word_list = ['a', 'b', 'c', 'a', 'c', 'c', 'b', 'a', 'c']

df = pd.DataFrame(word_list, columns=['word'])
as_counts = df.groupby('word', as_index=False).count()
print(as_counts)
  word
0    a
1    b
2    c

If you try the same with .size() as in

as_counts = df.groupby('word', as_index=False).size()

it works as expected

  word  size
0    a     3
1    b     2
2    c     4

What difference between the two functions explains this difference in output?

This post is asking a different question entirely as, in this case, the frame is neither MultiIndex nor seeking to maintain non-unique indices. Equally, none of the answers in that post address the problem of this question.


Solution

  • groupby.count is designed to work on the columns other than the grouping columns while groupby.size is designed to output a single new column (named size), irrespective of the number of existing columns.

    This makes sense since count depends on the NA status, while size is always constant within a group (the number of rows, irrespective of the values).

    This can easily be seen by adding several dummy columns:

    df.assign(dummy1=1, dummy2=None).groupby('word', as_index=False).count()
    
    # as many output columns as the input columns (minus grouper)
    # only the non NA values are counted
      word  dummy1  dummy2
    0    a       3       0
    1    b       2       0
    2    c       4       0
    

    VS:

    df.assign(dummy1=1, dummy2=None).groupby('word', as_index=False).size()
    
    # grouper columns + ONE "size" column
      word  size
    0    a     3
    1    b     2
    2    c     4
    

    This is actually explicitly stated in the documentation:

    DataFrameGroupBy.size()

    Returns:

    DataFrame or Series

    Number of rows in each group as a Series if as_index is True or a DataFrame if as_index is False.

    So back to your original example, groupby.counts outputs nothing since there are no columns to use (the grouping columns are excluded from the aggregation unless specifically sliced from the DataFrameGroupBy object).


    workaround

    As noted elsewhere, the workaround if you want to count one of the grouping columns is to explicitly slice it:

    df.groupby('word', as_index=False)['word'].count()
    
       word
    0     3
    1     2
    2     4
    

    Amusingly, you can slice it more than once and get as many columns in the output:

    df.groupby('word', as_index=False)[['word', 'word']].count()
    
       word  word
    0     3     3
    1     2     2
    2     4     4