pythonpandas

Using a column value to find the Column header name in Pandas


Scenario: I have a pandas dataframe. I am trying to use the values in a given column (year) to find the relevant header name and add it to a new column (year_name). For example, if the dataframe looks like this:

itemName 2020 2021 2022 2023 2024 year
item1 5 20 10 10 50 3
item2 10 10 50 20 40 2
item3 12 35 73 10 54 4

The result should be like this:

itemName 2020 2021 2022 2023 2024 year year_name
item1 5 20 10 10 50 3 2022
item2 10 10 50 20 40 2 2021
item3 12 35 73 10 54 4 2023

Obs. the itemName column is the index.

Issue: I am trying to use a lambda function to use the value of each row of "year" and use it to find the column name for that row and add it to the year_name column.

Function: I tried:

col_names = result_dict[col].columns.tolist()
result_df[[last_year_header']] = result_df[[_last_year']].apply(lambda x: col_names[x])

but this gave me the following error:

 TypeError: list indices must be integers or slices, not Series

I also tried:

col_names = result_dict[col].columns.tolist()
result_df[[last_year_header']] = result_df[[_last_year']].apply(lambda x: col_names[x.iloc[0].astype(int)])

But this gave me:

 IndexError: list index out of range

Question: I am clearly missing something with the implementation of the lambda function in this case. How can I fix this?


Solution

  • You don't need a lmabda, you should be able to directly index your columns index:

    df['year_name'] = df.columns[df['year']-1] #.astype('Int64')
    

    Output:

              2020  2021  2022  2023  2024  year year_name
    itemName                                              
    item1        5    20    10    10    50     3      2022
    item2       10    10    50    20    40     2      2021
    item3       12    35    73    10    54     4      2023
    

    If there can be invalid values in df['year'], you could use a Series with reindex:

    df['year_name'] = pd.Series(df.columns).reindex(df['year']-1).values
    

    Example output:

              2020  2021  2022  2023  2024  year year_name
    itemName                                              
    item1        5    20    10    10    50   3.0      2022
    item2       10    10    50    20    40  20.0       NaN
    item3       12    35    73    10    54   NaN       NaN
    

    Reproducible inputs:

    # Example 1 (valid values)
    df = pd.DataFrame.from_dict({
            'index': ['item1', 'item2', 'item3'],
            'columns': [2020, 2021, 2022, 2023, 2024, 'year'],
            'data': [[5, 20, 10, 10, 50, 3],
                     [10, 10, 50, 20, 40, 2],
                     [12, 35, 73, 10, 54, 4],],
            'index_names': ['itemName'],
            'column_names': [None],
        }, 'tight')
    
    # Example 2 (invalid values)
    df = pd.DataFrame.from_dict({
            'index': ['item1', 'item2', 'item3'],
            'columns': [2020, 2021, 2022, 2023, 2024, 'year'],
            'data': [[5, 20, 10, 10, 50, 3],
                     [10, 10, 50, 20, 40, 20],
                     [12, 35, 73, 10, 54, None],],
            'index_names': ['itemName'],
            'column_names': [None],
        }, 'tight')