pythonpandasdataframemergepandas-merge

Python: Selecting column values from multiple columns dynamically


I have 2 dataframes. The first is a summary table that summarizes the accuracy (in descending order) of each industry and its source.

cols = ['industry', 'source', 'accuracy']

df = pd.DataFrame(np.array([
    ['chemical', 'source B', 0.9],
    ['chemical', 'source A', 0.7],
    ['education', 'source A', 0.9],
]), columns=cols)

In the 2nd table, Source A and B have lists of strings in them, and they can be nulls:

cols = ['company', 'industry', 'source A', 'source B']

df2 = pd.DataFrame(np.array([
    ['company1', 'chemical', np.nan, ['a123', 'b456']],
    ['company2', 'chemical', ['a555', 'd333'], np.nan],
    ['company3', 'education', np.nan, ['777', '888']],
]), columns=cols)

For each row/company, I'm supposed to select the first non-null source that has the highest accuracy, which will look something like the following table:

cols = ['company', 'industry', 'which_source', 'source_value']

df3 = pd.DataFrame(np.array([
    ['company1', 'chemical', 'source B', ['a123', 'b456']],
    ['company2', 'chemical', 'source A', ['a555', 'd333']],
    ['company3', 'education', np.nan, np.nan],
]), columns=cols)

For e.g., for company1 and 2, although they're both from the 'chemical' industry, for company2 its source is from source A because its value in source B is null.

And for company3 from 'education' industry, even though there is a value in source B, as source B for 'education' industry doesn't meet some minimum threshold (hence it didn't appear in the df1), it's 'source' and 'source_value' should just be null.

Thanks in advance!


Solution

  • You could melt, merge and filter:

    df3 = (df2
     .melt(['company', 'industry'], var_name='source', value_name='source_value')
     .merge(df, how='inner')
     .sort_values(by='source_value', key=pd.isna)
     .groupby(['company', 'industry'], as_index=False).first()
     .assign(which_source=lambda d: d['source'].mask(d['source_value'].isna()))
     .drop(columns=['source', 'accuracy'])
    )
    

    output:

        company   industry  source_value which_source
    0  company1   chemical  [a123, b456]     source B
    1  company2   chemical  [a555, d333]     source A
    2  company3  education          None          NaN