I have 2 tables in consideration:
Table 1:
Country | Age | Marks |
---|---|---|
A | 25 | 7 |
B | 45 | 8 |
Table 2:
Age Band From | Age Band To | Country A | Country B |
---|---|---|---|
20 | 30 | ||
40 | 50 |
The output I want to have:
Age Band From | Age Band To | Country A | Country B |
---|---|---|---|
20 | 30 | 7 | NaN |
40 | 50 | NaN | 8 |
Here's what I tried:
for index,row in table2.iterrows():
table2.loc[index,'Country A'] = table1[(table1['Country']=='A')&
(table1['Age']>=row[0])&
(table1['Age']<=row[1])]['Marks'].values[0]
But this gives the following error:
index 0 is out of bounds for axis 0 with size 0
I think I've probably guessed where the error occurs: Whenever the compiler encounters a Age Band from Table2 for which a corresponding Age doesn't exist in Table1.
Any help on this problem is much appreciated! Thank You in advance...
There is no efficient builtin solution in pandas to perform this operation in a generic way. A cross-merge will work on small dataframes but will have poor to terrible efficiency on large datasets. In fact as it has a quadratic complexity, this will even crash python for anything above a few thousand rows.
A robust option is to pivot
then perform a conditional_join
using pyjanitor
:
# pip install pyjanitor
import janitor
out = (table2[['Age Band From', 'Age Band To']]
.conditional_join(table1.pivot(index='Age', columns='Country', values='Marks')
.add_prefix('Country ').reset_index(),
('Age Band From', 'Age', '<='),
('Age Band To', 'Age', '>='),
how='left'
)
)
Output:
Age Band From Age Band To Age Country A Country B
0 20 30 25 7.0 NaN
1 40 50 45 NaN 8.0
Using a length N
for both tables with random data (and deduplication):
Alternatively, if you only have a one to one mapping (i.e. no overlapping intervals, only one match in table1 per row of table2), you can use merge_asof
for a pure pandas solution:
tmp = (table1.pivot(index='Age', columns='Country', values='Marks')
.add_prefix('Country ').reset_index()
.sort_values(by='Age')
)
# merge on left boundary
out = (pd.merge_asof(table2[['Age Band From', 'Age Band To']].reset_index()
.sort_values(by='Age Band From'),
tmp, direction='forward',
left_on='Age Band From', right_on='Age')
.set_index('index').reindex(table2.index)
# hide based on right boundary
.where(lambda d: d.pop('Age').le(d['Age Band To']))
# restore Band data if needed
.combine_first(table2)
)
Output:
Age Band From Age Band To Country A Country B
0 20 30 7.0 NaN
1 40 50 NaN 8.0