pythonpandasdataframeinterpolationfillna

Pandas data frame use interpolate() partitioning with specific columns


I have the following Pandas data frame (called df).

+--------+--------+------+--------+
| Person | Animal | Year | Number |
+--------+--------+------+--------+
| John   | Dogs   | 2000 | 2      |
| John   | Dogs   | 2001 | 2      |
| John   | Dogs   | 2002 | 2      |
| John   | Dogs   | 2003 | 2      |
| John   | Dogs   | 2004 | 2      |
| John   | Dogs   | 2005 | 2      |
| John   | Cats   | 2000 | 1      |
| John   | Cats   | 2001 | NaN    |
| John   | Cats   | 2002 | NaN    |
| John   | Cats   | 2003 | 4      |
| John   | Cats   | 2004 | 5      |
| John   | Cats   | 2005 | 6      |
| Peter  | Dogs   | 2000 | NaN    |
| Peter  | Dogs   | 2001 | 1      |
| Peter  | Dogs   | 2002 | NaN    |
| Peter  | Dogs   | 2003 | 5      |
| Peter  | Dogs   | 2004 | 5      |
| Peter  | Dogs   | 2005 | 5      |
| Peter  | Cats   | 2000 | NaN    |
| Peter  | Cats   | 2001 | 4      |
| Peter  | Cats   | 2002 | 4      |
| Peter  | Cats   | 2003 | 4      |
| Peter  | Cats   | 2004 | 4      |
| Peter  | Cats   | 2005 | 4      |
+--------+--------+------+--------+

My target is to get the following, which means using the interpolate method to fill the NaN values, but based on the other column value. In other words, it should

  1. partition the df using the Person and Animal columns
  2. order by Year (asc)
  3. apply the interpolate method

.

+--------+--------+------+--------+
| Person | Animal | Year | Number |
+--------+--------+------+--------+
| John   | Dogs   | 2000 | 2      |
| John   | Dogs   | 2001 | 2      |
| John   | Dogs   | 2002 | 2      |
| John   | Dogs   | 2003 | 2      |
| John   | Dogs   | 2004 | 2      |
| John   | Dogs   | 2005 | 2      |
| John   | Cats   | 2000 | 1      |
| John   | Cats   | 2001 | 2      |
| John   | Cats   | 2002 | 3      |
| John   | Cats   | 2003 | 4      |
| John   | Cats   | 2004 | 5      |
| John   | Cats   | 2005 | 6      |
| Peter  | Dogs   | 2000 | NaN    |
| Peter  | Dogs   | 2001 | 1      |
| Peter  | Dogs   | 2002 | 3      |
| Peter  | Dogs   | 2003 | 5      |
| Peter  | Dogs   | 2004 | 5      |
| Peter  | Dogs   | 2005 | 5      |
| Peter  | Cats   | 2000 | NaN    |
| Peter  | Cats   | 2001 | 4      |
| Peter  | Cats   | 2002 | 4      |
| Peter  | Cats   | 2003 | 4      |
| Peter  | Cats   | 2004 | 4      |
| Peter  | Cats   | 2005 | 4      |
+--------+--------+------+--------+

What I have done

I can filter for each Person and each Animal and then apply the interpolate methods. Finally, merge all together, but this sounds dull and long if you have many columns.


Solution

  • You can try:

    df['Number'] = (df.sort_values('Year', ascending=True)
                      .groupby(['Person', 'Animal'])['Number']
                      .transform(lambda x: x.interpolate()))
    print(df)
    
    # Output
       Person Animal  Year  Number
    0    John   Dogs  2000     2.0
    1    John   Dogs  2001     2.0
    2    John   Dogs  2002     2.0
    3    John   Dogs  2003     2.0
    4    John   Dogs  2004     2.0
    5    John   Dogs  2005     2.0
    6    John   Cats  2000     1.0
    7    John   Cats  2001     2.0  # interpolate
    8    John   Cats  2002     3.0  # interpolate
    9    John   Cats  2003     4.0
    10   John   Cats  2004     5.0
    11   John   Cats  2005     6.0
    12  Peter   Dogs  2000     NaN
    13  Peter   Dogs  2001     1.0
    14  Peter   Dogs  2002     3.0
    15  Peter   Dogs  2003     5.0
    16  Peter   Dogs  2004     5.0
    17  Peter   Dogs  2005     5.0
    18  Peter   Cats  2000     NaN
    19  Peter   Cats  2001     4.0
    20  Peter   Cats  2002     4.0
    21  Peter   Cats  2003     4.0
    22  Peter   Cats  2004     4.0
    23  Peter   Cats  2005     4.0
    

    For multiple columns, just use the same operation:

    cols = ['Number']  # list of columns
    df[cols] = (df.sort_values('Year', ascending=True)
                  .groupby(['Person', 'Animal'])[cols]
                  .transform(lambda x: x.interpolate()))