pythonpandasaggregate

Python Pandas: Is Order Preserved When Using groupby() and agg()?


I've frequented used pandas' agg() function to run summary statistics on every column of a data.frame. For example, here's how you would produce the mean and standard deviation:

df = pd.DataFrame({'A': ['group1', 'group1', 'group2', 'group2', 'group3', 'group3'],
                   'B': [10, 12, 10, 25, 10, 12],
                   'C': [100, 102, 100, 250, 100, 102]})

>>> df
[output]
        A   B    C
0  group1  10  100
1  group1  12  102
2  group2  10  100
3  group2  25  250
4  group3  10  100
5  group3  12  102

In both of those cases, the order that individual rows are sent to the agg function does not matter. But consider the following example, which:

df.groupby('A').agg([np.mean, lambda x: x.iloc[1] ])

[output]

        mean  <lambda>  mean  <lambda>
A                                     
group1  11.0        12   101       102
group2  17.5        25   175       250
group3  11.0        12   101       102

In this case the lambda functions as intended, outputting the second row in each group. However, I have not been able to find anything in the pandas documentation that implies that this is guaranteed to be true in all cases. I want use agg() along with a weighted average function, so I want to be sure that the rows that come into the function will be in the same order as they appear in the original data frame.

Does anyone know, ideally via somewhere in the docs or pandas source code, if this is guaranteed to be the case?


Solution

  • See this enhancement issue

    The short answer is yes, the groupby will preserve the orderings as passed in. You can prove this by using your example like this:

    In [20]: df.sort_index(ascending=False).groupby('A').agg([np.mean, lambda x: x.iloc[1] ])
    Out[20]: 
               B             C         
            mean <lambda> mean <lambda>
    A                                  
    group1  11.0       10  101      100
    group2  17.5       10  175      100
    group3  11.0       10  101      100
    

    This is NOT true for resample however as it requires a monotonic index (it WILL work with a non-monotonic index, but will sort it first).

    Their is a sort= flag to groupby, but this relates to the sorting of the groups themselves and not the observations within a group.

    FYI: df.groupby('A').nth(1) is a safe way to get the 2nd value of a group (as your method above will fail if a group has < 2 elements)