python pandas dataframe multidimensional-array

Adding a multi-dimensional column to a pandas dataframe

I have two dataframes which look something like the following:

issue_df

    COL1  COL2                 REF
0  Name1  Val1        [REF1, REF2]
1  Name2  Val2                None
2  Name3  Val3  [REF1, REF2, REF3]
3  Name4  Val4                REF2

ref_df

   NAME  SCORE
0  REF1  100.0
1  REF2   99.0
2  REF3   85.0

I want to iterate through the rows of issue_df and query the ref_df for the values in the REF column and create a new merged data frame like the following output:

merged_df

    COL1  COL2                 REF  REF_INFO
0  Name1  Val1        [REF1, REF2]  REF1 : 100.0
                                    REF2 : 99.0
1  Name2  Val2                None  None
2  Name3  Val3  [REF1, REF2, REF3]  REF1 : 100.0
                                    REF2 : 99.0
                                    REF3 : 85.0
3  Name4  Val4                REF2  REF2 : 99.0

Then ideally I would be able to query this merged_df for any row where it has a ref where the score is less than a certain amount.

I've tried creating this as a dictionary first, where REF_INFO includes a column of nested dictionaries and then converting it to a dataframe, but this produces errors and warnings.

Test code I'm using:

import pandas as pd

issue_data = {
    'COL1': ['Name1', 'Name2', 'Name3', 'Name4'],
    'COL2': ['Val1', 'Val2', 'Val3', 'Val4'],
    'REF': [['REF1', 'REF2'], None, ['REF1', 'REF2', 'REF3'], 'REF2']}
ref_data = {
    'NAME': ['REF1', 'REF2', 'REF3'],
    'SCORE': [100.0, 99.0, 85.0]}

issue_df = pd.DataFrame(issue_data)
ref_df = pd.DataFrame(ref_data)

Any ideas on how to accomplish what I'm going for?

Solution

You can explode on column 'REF' of issue_df, then map the values from ref_df:

merged_df = issue_df.explode("REF")
merged_df["REF_INFO"] = merged_df["REF"].map(ref_df.set_index("NAME")["SCORE"])

    COL1  COL2   REF  REF_INFO
0  Name1  Val1  REF1     100.0
0  Name1  Val1  REF2      99.0
1  Name2  Val2     -       NaN
2  Name3  Val3  REF1     100.0
2  Name3  Val3  REF2      99.0
2  Name3  Val3  REF3      85.0
3  Name4  Val4  REF2      99.0

Then query merged_df to get rows less than a certain value:

merged_df.query("REF_INFO < 100")

    COL1  COL2   REF  REF_INFO
0  Name1  Val1  REF2      99.0
2  Name3  Val3  REF2      99.0
2  Name3  Val3  REF3      85.0
3  Name4  Val4  REF2      99.0

If the query should include other 'REF_INFO' values for the same original row, you can instead get the index of the items that match the query and use that to filter merged_df:

f = merged_df.query("REF_INFO < 90").index.values
q = merged_df[merged_df.index.isin(f)]

    COL1  COL2   REF  REF_INFO
2  Name3  Val3  REF1     100.0
2  Name3  Val3  REF2      99.0
2  Name3  Val3  REF3      85.0