pythonpandassortingmultilevel-analysis

Arrage dataframe based on the data presence in columns in multilevel dataframe


I have a multilevel columns in the pandas df with the index as appid as follows:

year   |2016    2017    2018    2019    2016  2017   2018   2019
       |ttl     ttl     ttl     ttl     tta   tta    tta    tta
-----------------------------------------------------------------
appid  |
75787  |NaN     227.0   470.0   426.0   NaN   25.0   23.0   21.0
146306 |NaN     858.0   226.0   NaN     NaN   14.0   35.0   NaN
159479 |NaN     NaN     0.0     NaN     NaN   NaN    3.5    NaN
163618 |NaN     0.0     650.0   100.0   NaN   12.0   14.6   123.0
215968 |23.0    0.0     NaN     NaN     45.0  2.0    NaN    NaN

I want to convert this df in such a way that it can be sorted via the latest year entries present. E.g.

Year   |P2Y      PY      LY    P2Y    PY    LY
       |ttl     ttl     ttl    tta    tta   tta

----------------------------------------------------
appid  |
75787  |227.0   470.0   426.0  25.0   23.0   21.0
146306 |NaN     858.0   226.0  NaN    14.0   35.0
159479 |NaN     NaN     0.0    NaN    NaN    3.5
163618 |0.0     650.0   100.0  12.0   14.6   123.0
215968 |NaN     23.0    0.0    NaN    45.0   2.0

Solution

  • You can try to work on the transposed dataset and use shift:

    df.T \
      .apply(lambda x: x.shift(len(x) - x.index.get_loc(x.last_valid_index()) - 1)) \
      .T \
      .dropna(how='all', axis='columns'))
    

    Explanations

    1. Transpose the dataset using .T

    2. Shift each column by the specified number of NaN values at the end of the column

      1. Use apply on each columns

      2. Find the last not NaN values using last_valid_index with get_loc. For more detail on this step, see this Locate first and last non NaN values in a Pandas DataFrame

      3. Compute the number of row shift from step 2.3 and len(x). Also substract 1 since the index from step 2.2 take the row index above.

      4. Use shift to shift the column

    3. Eventually transpose dataset back as in step 1 using .T

    4. Remove all NaN column using dropna and how='all', axis='columns'


    Code + illustration

    # Step 1
    print(df.T)
    #             75787   146306  159479  163618  215968
    # year appid
    # 2016 ttl       NaN     NaN     NaN     NaN    23.0
    # 2017 ttl     227.0   858.0     NaN     0.0     0.0
    # 2018 ttl     470.0   226.0     0.0   650.0     NaN
    # 2019 ttl     426.0     NaN     NaN   100.0     NaN
    # 2016 tta       NaN     NaN     NaN     NaN    45.0
    # 2017 tta      25.0    14.0     NaN    12.0     2.0
    # 2018 tta      23.0    35.0     3.5    14.6     NaN
    # 2019 tta      21.0     NaN     NaN   123.0     NaN
    
    
    # Step 2.2.1
    print(df.T.apply(lambda x: x.last_valid_index()))
    # 75787     (2019, tta)
    # 146306    (2018, tta)
    # 159479    (2018, tta)
    # 163618    (2019, tta)
    # 215968    (2017, tta)
    # dtype: object
    
    
    # Step 2.2.2
    print(df.T.apply(lambda x: x.index.get_loc(x.last_valid_index())))
    # 75787     7
    # 146306    6
    # 159479    6
    # 163618    7
    # 215968    5
    # dtype: int64
    
    
    # Step 2
    print(df.T.apply(lambda x: x.shift(
        len(x) - x.index.get_loc(x.last_valid_index()) - 1)))
    #             75787   146306  159479  163618  215968
    # year appid
    # 2016 ttl       NaN     NaN     NaN     NaN     NaN
    # 2017 ttl     227.0     NaN     NaN     0.0     NaN
    # 2018 ttl     470.0   858.0     NaN   650.0    23.0
    # 2019 ttl     426.0   226.0     0.0   100.0     0.0
    # 2016 tta       NaN     NaN     NaN     NaN     NaN
    # 2017 tta      25.0     NaN     NaN    12.0     NaN
    # 2018 tta      23.0    14.0     NaN    14.6    45.0
    # 2019 tta      21.0    35.0     3.5   123.0     2.0
    
    
    # Step 3
    print(df.T.apply(lambda x: x.shift(
        len(x) - x.index.get_loc(x.last_valid_index()) - 1)).T)
    # year   2016   2017   2018   2019 2016  2017  2018   2019
    # appid   ttl    ttl    ttl    ttl  tta   tta   tta    tta
    # 75787   NaN  227.0  470.0  426.0  NaN  25.0  23.0   21.0
    # 146306  NaN    NaN  858.0  226.0  NaN   NaN  14.0   35.0
    # 159479  NaN    NaN    NaN    0.0  NaN   NaN   NaN    3.5
    # 163618  NaN    0.0  650.0  100.0  NaN  12.0  14.6  123.0
    # 215968  NaN    NaN   23.0    0.0  NaN   NaN  45.0    2.0
    
    
    
    # Step 4
    print(df.T.apply(lambda x: x.shift(
        len(x) - x.index.get_loc(x.last_valid_index()) - 1)).T
        .dropna(how='all', axis='columns'))
    
    # year     2017   2018   2019  2017  2018   2019
    # appid     ttl    ttl    ttl   tta   tta    tta
    # 75787   227.0  470.0  426.0  25.0  23.0   21.0
    # 146306    NaN  858.0  226.0   NaN  14.0   35.0
    # 159479    NaN    NaN    0.0   NaN   NaN    3.5
    # 163618    0.0  650.0  100.0  12.0  14.6  123.0
    # 215968    NaN   23.0    0.0   NaN  45.0    2.0