pythonpandasdataframemultidimensional-array

Adding a multi-dimensional column to a pandas dataframe


I have two dataframes which look something like the following:

issue_df

    COL1  COL2                 REF
0  Name1  Val1        [REF1, REF2]
1  Name2  Val2                None
2  Name3  Val3  [REF1, REF2, REF3]
3  Name4  Val4                REF2

ref_df

   NAME  SCORE
0  REF1  100.0
1  REF2   99.0
2  REF3   85.0

I want to iterate through the rows of issue_df and query the ref_df for the values in the REF column and create a new merged data frame like the following output:

merged_df

    COL1  COL2                 REF  REF_INFO
0  Name1  Val1        [REF1, REF2]  REF1 : 100.0
                                    REF2 : 99.0
1  Name2  Val2                None  None
2  Name3  Val3  [REF1, REF2, REF3]  REF1 : 100.0
                                    REF2 : 99.0
                                    REF3 : 85.0
3  Name4  Val4                REF2  REF2 : 99.0

Then ideally I would be able to query this merged_df for any row where it has a ref where the score is less than a certain amount.

I've tried creating this as a dictionary first, where REF_INFO includes a column of nested dictionaries and then converting it to a dataframe, but this produces errors and warnings.

Test code I'm using:

import pandas as pd

issue_data = {
    'COL1': ['Name1', 'Name2', 'Name3', 'Name4'],
    'COL2': ['Val1', 'Val2', 'Val3', 'Val4'],
    'REF': [['REF1', 'REF2'], None, ['REF1', 'REF2', 'REF3'], 'REF2']}
ref_data = {
    'NAME': ['REF1', 'REF2', 'REF3'],
    'SCORE': [100.0, 99.0, 85.0]}

issue_df = pd.DataFrame(issue_data)
ref_df = pd.DataFrame(ref_data)

Any ideas on how to accomplish what I'm going for?


Solution

  • You can explode on column 'REF' of issue_df, then map the values from ref_df:

    merged_df = issue_df.explode("REF")
    merged_df["REF_INFO"] = merged_df["REF"].map(ref_df.set_index("NAME")["SCORE"])
    
        COL1  COL2   REF  REF_INFO
    0  Name1  Val1  REF1     100.0
    0  Name1  Val1  REF2      99.0
    1  Name2  Val2     -       NaN
    2  Name3  Val3  REF1     100.0
    2  Name3  Val3  REF2      99.0
    2  Name3  Val3  REF3      85.0
    3  Name4  Val4  REF2      99.0
    

    Then query merged_df to get rows less than a certain value:

    merged_df.query("REF_INFO < 100")
    
        COL1  COL2   REF  REF_INFO
    0  Name1  Val1  REF2      99.0
    2  Name3  Val3  REF2      99.0
    2  Name3  Val3  REF3      85.0
    3  Name4  Val4  REF2      99.0
    

    If the query should include other 'REF_INFO' values for the same original row, you can instead get the index of the items that match the query and use that to filter merged_df:

    f = merged_df.query("REF_INFO < 90").index.values
    q = merged_df[merged_df.index.isin(f)]
    
        COL1  COL2   REF  REF_INFO
    2  Name3  Val3  REF1     100.0
    2  Name3  Val3  REF2      99.0
    2  Name3  Val3  REF3      85.0