I want to assemble a dataframe from three separate dfs, each of which have the contents of a column. The trick is I want each row of the new df to be dynamically created based on a year offset.
For example, my background theory is that there is a time lag that flows each year from Funding -> Staff -> Output, and I want to assemble the story of each row. Each row should start with a year N's Output, then have year N-2's Funding, then year N-1's Staff.
Desired final combined df
Name | ID | Year_Output | Output | Year_Funding | Funding | Year_Staff | Staff |
---|---|---|---|---|---|---|---|
ABC | 1 | 2023 | 25 | 2021 | 500 | 2022 | 18 |
DEF | 2 | 2022 | 51 | 2020 | 1000 | 2021 | 12 |
Example Output
df (others are formatted the same)
Name | ID | Year | Output |
---|---|---|---|
ABC | 1 | 2023 | 25 |
DEF | 2 | 2022 | 51 |
What is the best way to assemble this programmatically in pandas? I've tried creating a new year column for Year_Funding
that keys off the Year_Output
(always N-2), but then how do I merge the specific ID-Year
cell into the correct place?
MWE
import pandas as pd
d = {'Name': ['ABC', 'DEF'],
'ID': [1, 2],
'Year': [2023, 2022],
'Output': [25, 51]}
output_df = pd.DataFrame(data=d)
e = {'Name': ['ABC', 'DEF', 'ABC'],
'ID': [1, 2, 1],
'Year': [2021, 2020, 2023],
'Funding': [500, 1000, 17]}
funding_df = pd.DataFrame(data=e)
combined_df = (?)
edit: df e
had Output as column name, changed to Funding
edit 2: clarifying that each separate df can have multiple years of data for an ID
I was able to get this to work using .apply
def get_Funding_info(row):
return funding_df.loc[( (funding_df['ID']==row['ID']) & (funding_df['Year']==row['Year_Funding']) ), 'Funding'].squeeze()
combined_df['Funding'] = combined_df.apply(get_Funding_info, axis=1)