pythonpandasdataframe

Assemble dataframe from separate dfs, selecting particular cells


I want to assemble a dataframe from three separate dfs, each of which have the contents of a column. The trick is I want each row of the new df to be dynamically created based on a year offset.

For example, my background theory is that there is a time lag that flows each year from Funding -> Staff -> Output, and I want to assemble the story of each row. Each row should start with a year N's Output, then have year N-2's Funding, then year N-1's Staff.

Desired final combined df

Name ID Year_Output Output Year_Funding Funding Year_Staff Staff
ABC 1 2023 25 2021 500 2022 18
DEF 2 2022 51 2020 1000 2021 12

Example Output df (others are formatted the same)

Name ID Year Output
ABC 1 2023 25
DEF 2 2022 51

What is the best way to assemble this programmatically in pandas? I've tried creating a new year column for Year_Funding that keys off the Year_Output (always N-2), but then how do I merge the specific ID-Year cell into the correct place?

MWE

import pandas as pd

d = {'Name': ['ABC', 'DEF'],
     'ID': [1, 2],
     'Year': [2023, 2022],
     'Output': [25, 51]}
output_df = pd.DataFrame(data=d)

e = {'Name': ['ABC', 'DEF', 'ABC'],
     'ID': [1, 2, 1],
     'Year': [2021, 2020, 2023],
     'Funding': [500, 1000, 17]}
funding_df = pd.DataFrame(data=e)


combined_df = (?)

edit: df e had Output as column name, changed to Funding

edit 2: clarifying that each separate df can have multiple years of data for an ID


Solution

  • I was able to get this to work using .apply

    def get_Funding_info(row):
        return funding_df.loc[( (funding_df['ID']==row['ID']) & (funding_df['Year']==row['Year_Funding']) ), 'Funding'].squeeze()
    
    combined_df['Funding'] = combined_df.apply(get_Funding_info, axis=1)