pandaspandas-groupby

What are 25%,50%,75% values when we describe a grouped dataframe?


I am going through pandas groupby docs and when I groupby on particular column as below:

df:

     A      B         C         D
0  foo    one -0.987674  0.039616
1  bar    one -0.653247 -1.022529
2  foo    two  0.404201  1.308777
3  bar  three  1.620780  0.574377
4  foo    two  1.661942  0.579888
5  bar    two  0.747878  0.463052
6  foo    one  0.070278  0.202564
7  foo  three  0.779684 -0.547192

grouped=df.groupby('A')
grouped.describe(A)

gives

              C                      ...         D                    
          count      mean       std  ...       50%       75%       max
A   B                                ...                              
bar one     1.0  0.224944       NaN  ...  1.107509  1.107509  1.107509
    three   1.0  0.704943       NaN  ...  1.833098  1.833098  1.833098
    two     1.0 -0.091613       NaN  ... -0.549254 -0.549254 -0.549254
foo one     2.0  0.282298  1.554401  ... -0.334058  0.046640  0.427338
    three   1.0  1.688601       NaN  ... -1.457338 -1.457338 -1.457338
    two     2.0  1.206690  0.917140  ... -0.096405  0.039241  0.174888

what 25%,50%,75% signifies when described? a bit of explaination please?


Solution

  • You can test DataFrameGroupBy.describe:

    Notes:

    For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.


    can you explain for foo-one value for above eg?

    It is called Mulitindex:

    Hierarchical / Multi-level indexing is very exciting as it opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to store and manipulate data with an arbitrary number of dimensions in lower dimensional data structures like Series (1d) and DataFrame (2d).

    grouped=df.groupby(['A', 'B'])
    df = grouped.describe()
    
    print (df.index)
    MultiIndex([('bar',   'one'),
                ('bar', 'three'),
                ('bar',   'two'),
                ('foo',   'one'),
                ('foo', 'three'),
                ('foo',   'two')],
               names=['A', 'B'])
    
    print (df.columns)
    MultiIndex([('C', 'count'),
                ('C',  'mean'),
                ('C',   'std'),
                ('C',   'min'),
                ('C',   '25%'),
                ('C',   '50%'),
                ('C',   '75%'),
                ('C',   'max'),
                ('D', 'count'),
                ('D',  'mean'),
                ('D',   'std'),
                ('D',   'min'),
                ('D',   '25%'),
                ('D',   '50%'),
                ('D',   '75%'),
                ('D',   'max')],
               )
    
    print (df.loc[('foo','one'), ('C', '75%')])
    -0.19421