I have the following Pandas data frame (called df
).
+--------+--------+------+--------+
| Person | Animal | Year | Number |
+--------+--------+------+--------+
| John | Dogs | 2000 | 2 |
| John | Dogs | 2001 | 2 |
| John | Dogs | 2002 | 2 |
| John | Dogs | 2003 | 2 |
| John | Dogs | 2004 | 2 |
| John | Dogs | 2005 | 2 |
| John | Cats | 2000 | 1 |
| John | Cats | 2001 | NaN |
| John | Cats | 2002 | NaN |
| John | Cats | 2003 | 4 |
| John | Cats | 2004 | 5 |
| John | Cats | 2005 | 6 |
| Peter | Dogs | 2000 | NaN |
| Peter | Dogs | 2001 | 1 |
| Peter | Dogs | 2002 | NaN |
| Peter | Dogs | 2003 | 5 |
| Peter | Dogs | 2004 | 5 |
| Peter | Dogs | 2005 | 5 |
| Peter | Cats | 2000 | NaN |
| Peter | Cats | 2001 | 4 |
| Peter | Cats | 2002 | 4 |
| Peter | Cats | 2003 | 4 |
| Peter | Cats | 2004 | 4 |
| Peter | Cats | 2005 | 4 |
+--------+--------+------+--------+
My target is to get the following, which means using the interpolate method to fill the NaN
values, but based on the other column value. In other words, it should
Person
and Animal
columnsYear
(asc).
+--------+--------+------+--------+
| Person | Animal | Year | Number |
+--------+--------+------+--------+
| John | Dogs | 2000 | 2 |
| John | Dogs | 2001 | 2 |
| John | Dogs | 2002 | 2 |
| John | Dogs | 2003 | 2 |
| John | Dogs | 2004 | 2 |
| John | Dogs | 2005 | 2 |
| John | Cats | 2000 | 1 |
| John | Cats | 2001 | 2 |
| John | Cats | 2002 | 3 |
| John | Cats | 2003 | 4 |
| John | Cats | 2004 | 5 |
| John | Cats | 2005 | 6 |
| Peter | Dogs | 2000 | NaN |
| Peter | Dogs | 2001 | 1 |
| Peter | Dogs | 2002 | 3 |
| Peter | Dogs | 2003 | 5 |
| Peter | Dogs | 2004 | 5 |
| Peter | Dogs | 2005 | 5 |
| Peter | Cats | 2000 | NaN |
| Peter | Cats | 2001 | 4 |
| Peter | Cats | 2002 | 4 |
| Peter | Cats | 2003 | 4 |
| Peter | Cats | 2004 | 4 |
| Peter | Cats | 2005 | 4 |
+--------+--------+------+--------+
What I have done
I can filter for each Person and each Animal and then apply the interpolate methods. Finally, merge all together, but this sounds dull and long if you have many columns.
You can try:
df['Number'] = (df.sort_values('Year', ascending=True)
.groupby(['Person', 'Animal'])['Number']
.transform(lambda x: x.interpolate()))
print(df)
# Output
Person Animal Year Number
0 John Dogs 2000 2.0
1 John Dogs 2001 2.0
2 John Dogs 2002 2.0
3 John Dogs 2003 2.0
4 John Dogs 2004 2.0
5 John Dogs 2005 2.0
6 John Cats 2000 1.0
7 John Cats 2001 2.0 # interpolate
8 John Cats 2002 3.0 # interpolate
9 John Cats 2003 4.0
10 John Cats 2004 5.0
11 John Cats 2005 6.0
12 Peter Dogs 2000 NaN
13 Peter Dogs 2001 1.0
14 Peter Dogs 2002 3.0
15 Peter Dogs 2003 5.0
16 Peter Dogs 2004 5.0
17 Peter Dogs 2005 5.0
18 Peter Cats 2000 NaN
19 Peter Cats 2001 4.0
20 Peter Cats 2002 4.0
21 Peter Cats 2003 4.0
22 Peter Cats 2004 4.0
23 Peter Cats 2005 4.0
For multiple columns, just use the same operation:
cols = ['Number'] # list of columns
df[cols] = (df.sort_values('Year', ascending=True)
.groupby(['Person', 'Animal'])[cols]
.transform(lambda x: x.interpolate()))