I have two separate lists each of n
elements, with one being ID numbers and the second being pandas dataframes. I will define them as id
and dfs
. The dataframes in dfs
have the same format with columns A
, B
, and C
but with different numerical values. I have zipped the two lists together as such:
df_groups = list(zip(id, dfs))
With this list I am trying to locate any instances where id
is the same and then add columns A
and B
together for those dataframes and merge into one dataframe. For an example, I will use the following:
id = ['a','b','c','a','d']
The corresponding dataframes I have may look as such:
dfs[0]
A B C
0 0 1
0 0 1
dfs[1]
A B C
0 1 1
0 1 1
dfs[2]
A B C
1 1 1
1 2 1
dfs[3]
A B C
5 6 1
11 8 1
dfs[4]
A B C
3 5 2
3 18 2
Then, as can be seen above, id[0]
is the same as id[3]
. As such, I want to create a new list of tuples such that dfs[0]['A']
and dfs[3]['A']
are added together (similarly also for column B
and the duplicate id
value is dropped.
Thus, it should look like this:
id = ['a','b','c','d']
dfs[0]
A B C
5 6 1
11 8 1
dfs[1]
A B C
0 1 1
0 1 1
dfs[2]
A B C
1 1 1
1 2 1
dfs[3]
A B C
3 5 2
3 18 2
The following worked for removing the duplicate values of id
but I am not quite sure how to go about the column operations on dfs
. I will of course need to add the columns A
and B
first before running the below:
from itertools import groupby
df_groups_b = ([next(b) for a, b in groupby(df_groups, lambda x: x[0])])
Any assistance would be much appreciated, thank you!
Edit: to clarify, column C from the original dataframe would be retained as is. In the case where the first tuple elements match, column C from the corresponding dataframes will be identical.
You can write a custom summarising function to go through all dataframes in a group and return a sum.
I don't entirely love this solution, because in converts C
to float, but you can play with it further if needed:
from itertools import groupby
import pandas as pd
ids = ['a','b','c','a','d']
dfs = [
pd.DataFrame({"A": [0], "B": [0], "C": [0]}),
pd.DataFrame({"A": [1], "B": [1], "C": [1]}),
pd.DataFrame({"A": [2], "B": [2], "C": [2]}),
pd.DataFrame({"A": [3], "B": [3], "C": [3]}),
pd.DataFrame({"A": [4], "B": [4], "C": [4]})
]
df_groups = list(zip(ids, dfs))
df_groups = sorted(df_groups, key=lambda x: x[0])
def summarise_cols(group, cols_to_summarise=["A", "B"]):
_, df = next(group)
for _, next_df in group:
df = df.add(next_df[cols_to_summarise], fill_value=0)
return df
df_groups_b = ([summarise_cols(group)
for _, group in groupby(df_groups, lambda x: x[0])])
for d in df_groups_b:
print(d)
Output:
A B C
0 3 3 0.0
A B C
0 1 1 1
A B C
0 2 2 2
A B C
0 4 4 4
UPD: just noticed your update
Edit: to clarify, column C from the original dataframe would be retained as is. In the case where the first tuple elements match, column C from the corresponding dataframes will be identical.
You can then do this instead and the column types will be unchanged
def summarise_cols(group, cols_to_keep=["C"]):
_, df = next(group)
for _, next_df in group:
df += next_df
df[cols_to_keep] = next_df[cols_to_keep]
return df
UPD2: Returning group_id and unified dataframe together as a tuple.
def summarise_cols(group, cols_to_keep=["C"]):
group_id, df = next(group)
for _, next_df in group:
df += next_df
df[cols_to_keep] = next_df[cols_to_keep]
return group_id, df