pythonpandasnumpy

How do I understand grouped[l].size()?


I try to understand the following code grouped[l].size() as in:

f = 10
n = 250
np.random.seed(100)
x = np.random.randint(0,2,(n,f))
y = np.random.randint(0,2,n)
fcols = [f'f{_}' for _ in range(f)]
data = pd.DataFrame(x, columns = fcols)
data['l'] = y
grouped = data.groupby(list(data.columns))
print(grouped['l'].size())

why it prints like this:

f0  f1  f2  f3  f4  f5  f6  f7  f8  f9  l
0   0   0   0   0   0   0   1   1   1   1    1
                        1   0   1   0   0    1
                                        1    1
                                    1   1    1
                    1   0   0   0   0   0    1
                                            ..
1   1   1   1   1   0   0   0   0   0   1    1
                            1   0   0   0    1
                        1   1   0   0   1    1
                    1   1   0   0   0   0    1
                            1   0   1   1    2
Name: l, Length: 239, dtype: int64

I read it from official website:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

This is what confuses me: l has value either 0 or 1, so when it is used as group condition, the output l should only have 0 or 1 too, 2 lines in output with Length 2 (now it is Length 239). the above output in my mind should have form like this:

f0  f1  f2  f3  f4  f5  f6  f7  f8  f9  l
0   0   0   0   0   0   0   1   1   1   1    1
                        1   0   1   0   0    2 
  1. WHY length is 239, not 2?
  2. WHY many f column is empty, not with any value?

Solution

  • Why length is 239, not 2?

    groupby.size (which the way you use it could be replaced with value_counts) will consider the combinations of columns.

    In your case, you have 239 unique combinations of all the columns out of the 250 rows.

    You are not using l as grouper here but all column (list(data.columns)). To group by l you would have needed:

    data.groupby('l').size()
    
    # or
    # data['l'].value_counts()
    
    # l
    # 0    130
    # 1    120
    # dtype: int64
    

    Why many f column is empty, not with any value?

    The "blank" values in the MultiIndex in your output are just a way to represent the values equal to the previous ones. Since groupby sorts the group by default, the indices will often be similar to the previous ones an this effect will be quite obvious.

    If you use reset_index() you will see that the values are indeed there.