pandasdataframedata-manipulationdata-filtering

How to populate a pandas df based on conditions from another df?


I have 2 tables in consideration:

Table 1:

Country Age Marks
A 25 7
B 45 8

Table 2:

Age Band From Age Band To Country A Country B
20 30
40 50

The output I want to have:

Age Band From Age Band To Country A Country B
20 30 7 NaN
40 50 NaN 8

Here's what I tried:

for index,row in table2.iterrows():
    table2.loc[index,'Country A'] = table1[(table1['Country']=='A')&
                                           (table1['Age']>=row[0])&
                                           (table1['Age']<=row[1])]['Marks'].values[0]

But this gives the following error: index 0 is out of bounds for axis 0 with size 0

I think I've probably guessed where the error occurs: Whenever the compiler encounters a Age Band from Table2 for which a corresponding Age doesn't exist in Table1.

Any help on this problem is much appreciated! Thank You in advance...


Solution

  • There is no efficient builtin solution in pandas to perform this operation in a generic way. A cross-merge will work on small dataframes but will have poor to terrible efficiency on large datasets. In fact as it has a quadratic complexity, this will even crash python for anything above a few thousand rows.

    A robust option is to pivot then perform a conditional_join using pyjanitor:

    # pip install pyjanitor
    import janitor
    
    out = (table2[['Age Band From', 'Age Band To']]
           .conditional_join(table1.pivot(index='Age', columns='Country', values='Marks')
                                   .add_prefix('Country ').reset_index(),
                             ('Age Band From', 'Age', '<='),
                             ('Age Band To', 'Age', '>='),
                             how='left'
                            )
          )
    

    Output:

       Age Band From  Age Band To  Age  Country A  Country B
    0             20           30   25        7.0        NaN
    1             40           50   45        NaN        8.0
    

    Comparison of efficiencies

    Using a length N for both tables with random data (and deduplication):

    enter image description here

    Alternatively, if you only have a one to one mapping (i.e. no overlapping intervals, only one match in table1 per row of table2), you can use merge_asof for a pure pandas solution:

    tmp = (table1.pivot(index='Age', columns='Country', values='Marks')
                 .add_prefix('Country ').reset_index()
                 .sort_values(by='Age')
          )
    
             # merge on left boundary
    out = (pd.merge_asof(table2[['Age Band From', 'Age Band To']].reset_index()
                         .sort_values(by='Age Band From'),
                         tmp, direction='forward',
                         left_on='Age Band From', right_on='Age')
             .set_index('index').reindex(table2.index)
             # hide based on right boundary
             .where(lambda d: d.pop('Age').le(d['Age Band To']))
             # restore Band data if needed
             .combine_first(table2)
          )
    

    Output:

       Age Band From  Age Band To  Country A  Country B
    0             20           30        7.0        NaN
    1             40           50        NaN        8.0