pythonpandasdataframegroup-bypandas-resample

Pandas - How to merge rows based on a criteria


I have a Pandas dataframe with thousands of rows that I want to combine to smaller number of rows.

In the new dataframe, I want:

my search shows I might use agg function to achieve the above piece (not too sure). But hardest part is to write the criteria with Pandas that select the number of rows needed to merged. Here is an example:

df = pd.DataFrame({'col1': [1, 1, 2, 2, 3, 4, 2],
                   'col2': [10, 20, 30, 40, 50, 60, 70],
                   'col3': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7],
                   'col4': [2, 1, 4, 5, 8, 10, 3]})

My criteria for the number of rows to be merged is so that summation of column 1 values for them to be above a certain threshold. Assume it's 3 here:

Final dataframe would look like this:

Columns 1 and 2 are sum of grouped rows. Column 3 is average of grouped rows. Column 3 is weighted average of grouped rows with weight factor being column 1.

newdf = pd.DataFrame({'col1': [4, 5, 4, 2],
                   'col2': [60, 90, 60, 70],
                   'col3': [0.2, 0.45, 0.6, 0.7],
                   'col4': [2.75, 6.8, 10, 3]})

Can someone help me?

I looked into something like this but two issues remained: the grouping criteria doesn't work, and not sure how the weighted average

agg_funcs = {'col1': 'sum', 'col2': 'sum','col3': 'mean', 'col4': lambda x: (x['col1'] * x['col4']).sum() / x['col1'].sum()}

grouped_df = df.groupby('col1').agg(agg_funcs)

The code failed and complained about the weighted average piece, but the grouping criteria is not correct too!


Solution

  • You need to change two things:

    def threshold_grouper(s, thresh=3):
        group = []
        i = 0
        total = 0
        for val in s:
            total += val
            group.append(i)
            if total>=thresh:
                i+=1
                total=0
        return group
    
    (df.eval('col4 = col4*col1')
       .groupby(threshold_grouper(df['col1']))
       .agg({'col1': 'sum', 'col2': 'sum', 'col3': 'mean', 'col4': 'sum'})
       .eval('col4 = col4/col1')
    )
    

    You can also cheat and use a side effect in agg, accessing "col1" externally:

    (df.groupby(threshold_grouper(df['col1']))
       .agg({'col1': 'sum', 'col2': 'sum', 'col3': 'mean',
             'col4': lambda g: np.average(g, weights=df['col1'].reindex_like(g))})
    )
    

    Output:

       col1  col2  col3   col4
    0     4    60  0.20   2.75
    1     5    90  0.45   6.80
    2     4    60  0.60  10.00
    3     2    70  0.70   3.00