pythonpandasitertools-groupby

List of tuples: combine two dataframe columns if first tuple elements match


I have two separate lists each of n elements, with one being ID numbers and the second being pandas dataframes. I will define them as id and dfs. The dataframes in dfs have the same format with columns A, B, and C but with different numerical values. I have zipped the two lists together as such:

df_groups = list(zip(id, dfs))

With this list I am trying to locate any instances where id is the same and then add columns A and B together for those dataframes and merge into one dataframe. For an example, I will use the following:

id = ['a','b','c','a','d']

The corresponding dataframes I have may look as such:

dfs[0]
A  B  C
0  0  1
0  0  1

dfs[1]
A  B  C
0  1  1
0  1  1

dfs[2]
A  B  C
1  1  1
1  2  1

dfs[3]
A  B  C
5  6  1
11 8  1

dfs[4]
A  B  C
3  5  2
3  18 2

Then, as can be seen above, id[0] is the same as id[3]. As such, I want to create a new list of tuples such that dfs[0]['A'] and dfs[3]['A'] are added together (similarly also for column B and the duplicate id value is dropped.

Thus, it should look like this:

id = ['a','b','c','d']

dfs[0]
A  B  C
5  6  1
11 8  1

dfs[1]
A  B  C
0  1  1
0  1  1

dfs[2]
A  B  C
1  1  1
1  2  1

dfs[3]
A  B  C
3  5  2
3  18 2

The following worked for removing the duplicate values of id but I am not quite sure how to go about the column operations on dfs. I will of course need to add the columns A and B first before running the below:

from itertools import groupby

df_groups_b = ([next(b) for a, b in groupby(df_groups, lambda x: x[0])])

Any assistance would be much appreciated, thank you!

Edit: to clarify, column C from the original dataframe would be retained as is. In the case where the first tuple elements match, column C from the corresponding dataframes will be identical.


Solution

  • You can write a custom summarising function to go through all dataframes in a group and return a sum.

    I don't entirely love this solution, because in converts C to float, but you can play with it further if needed:

    from itertools import groupby
    import pandas as pd
    
    ids = ['a','b','c','a','d']
    dfs = [
        pd.DataFrame({"A": [0], "B": [0], "C": [0]}),
        pd.DataFrame({"A": [1], "B": [1], "C": [1]}),
        pd.DataFrame({"A": [2], "B": [2], "C": [2]}),
        pd.DataFrame({"A": [3], "B": [3], "C": [3]}),
        pd.DataFrame({"A": [4], "B": [4], "C": [4]})
    ]
    df_groups = list(zip(ids, dfs))
    df_groups = sorted(df_groups, key=lambda x: x[0])
    
    def summarise_cols(group, cols_to_summarise=["A", "B"]):
        _, df = next(group)
        for _, next_df in group:
            df = df.add(next_df[cols_to_summarise], fill_value=0)
        return df
    
    df_groups_b = ([summarise_cols(group)
                    for _, group in groupby(df_groups, lambda x: x[0])])
    
    for d in df_groups_b:
        print(d)
    

    Output:

       A  B    C
    0  3  3  0.0
       A  B  C
    0  1  1  1
       A  B  C
    0  2  2  2
       A  B  C
    0  4  4  4
    

    UPD: just noticed your update

    Edit: to clarify, column C from the original dataframe would be retained as is. In the case where the first tuple elements match, column C from the corresponding dataframes will be identical.

    You can then do this instead and the column types will be unchanged

    def summarise_cols(group, cols_to_keep=["C"]):
        _, df = next(group)
        for _, next_df in group:
            df += next_df
            df[cols_to_keep] = next_df[cols_to_keep]
        return df
    

    UPD2: Returning group_id and unified dataframe together as a tuple.

    def summarise_cols(group, cols_to_keep=["C"]):
        group_id, df = next(group)
        for _, next_df in group:
            df += next_df
            df[cols_to_keep] = next_df[cols_to_keep]
        return group_id, df