I have a multilevel columns in the pandas df
with the index as appid
as follows:
year |2016 2017 2018 2019 2016 2017 2018 2019
|ttl ttl ttl ttl tta tta tta tta
-----------------------------------------------------------------
appid |
75787 |NaN 227.0 470.0 426.0 NaN 25.0 23.0 21.0
146306 |NaN 858.0 226.0 NaN NaN 14.0 35.0 NaN
159479 |NaN NaN 0.0 NaN NaN NaN 3.5 NaN
163618 |NaN 0.0 650.0 100.0 NaN 12.0 14.6 123.0
215968 |23.0 0.0 NaN NaN 45.0 2.0 NaN NaN
I want to convert this df
in such a way that it can be sorted via the latest year entries present. E.g.
Year |P2Y PY LY P2Y PY LY
|ttl ttl ttl tta tta tta
----------------------------------------------------
appid |
75787 |227.0 470.0 426.0 25.0 23.0 21.0
146306 |NaN 858.0 226.0 NaN 14.0 35.0
159479 |NaN NaN 0.0 NaN NaN 3.5
163618 |0.0 650.0 100.0 12.0 14.6 123.0
215968 |NaN 23.0 0.0 NaN 45.0 2.0
You can try to work on the transposed dataset and use shift
:
df.T \
.apply(lambda x: x.shift(len(x) - x.index.get_loc(x.last_valid_index()) - 1)) \
.T \
.dropna(how='all', axis='columns'))
Explanations
Transpose the dataset using .T
Shift each column by the specified number of NaN
values at the end of the column
Use apply
on each columns
Find the last not NaN
values using last_valid_index
with get_loc
. For more detail on this step, see this Locate first and last non NaN values in a Pandas DataFrame
Compute the number of row shift from step 2.3 and len(x)
. Also substract 1 since the index from step 2.2 take the row index above.
Use shift
to shift the column
Eventually transpose dataset back as in step 1 using .T
Remove all NaN
column using dropna
and how='all', axis='columns'
Code + illustration
# Step 1
print(df.T)
# 75787 146306 159479 163618 215968
# year appid
# 2016 ttl NaN NaN NaN NaN 23.0
# 2017 ttl 227.0 858.0 NaN 0.0 0.0
# 2018 ttl 470.0 226.0 0.0 650.0 NaN
# 2019 ttl 426.0 NaN NaN 100.0 NaN
# 2016 tta NaN NaN NaN NaN 45.0
# 2017 tta 25.0 14.0 NaN 12.0 2.0
# 2018 tta 23.0 35.0 3.5 14.6 NaN
# 2019 tta 21.0 NaN NaN 123.0 NaN
# Step 2.2.1
print(df.T.apply(lambda x: x.last_valid_index()))
# 75787 (2019, tta)
# 146306 (2018, tta)
# 159479 (2018, tta)
# 163618 (2019, tta)
# 215968 (2017, tta)
# dtype: object
# Step 2.2.2
print(df.T.apply(lambda x: x.index.get_loc(x.last_valid_index())))
# 75787 7
# 146306 6
# 159479 6
# 163618 7
# 215968 5
# dtype: int64
# Step 2
print(df.T.apply(lambda x: x.shift(
len(x) - x.index.get_loc(x.last_valid_index()) - 1)))
# 75787 146306 159479 163618 215968
# year appid
# 2016 ttl NaN NaN NaN NaN NaN
# 2017 ttl 227.0 NaN NaN 0.0 NaN
# 2018 ttl 470.0 858.0 NaN 650.0 23.0
# 2019 ttl 426.0 226.0 0.0 100.0 0.0
# 2016 tta NaN NaN NaN NaN NaN
# 2017 tta 25.0 NaN NaN 12.0 NaN
# 2018 tta 23.0 14.0 NaN 14.6 45.0
# 2019 tta 21.0 35.0 3.5 123.0 2.0
# Step 3
print(df.T.apply(lambda x: x.shift(
len(x) - x.index.get_loc(x.last_valid_index()) - 1)).T)
# year 2016 2017 2018 2019 2016 2017 2018 2019
# appid ttl ttl ttl ttl tta tta tta tta
# 75787 NaN 227.0 470.0 426.0 NaN 25.0 23.0 21.0
# 146306 NaN NaN 858.0 226.0 NaN NaN 14.0 35.0
# 159479 NaN NaN NaN 0.0 NaN NaN NaN 3.5
# 163618 NaN 0.0 650.0 100.0 NaN 12.0 14.6 123.0
# 215968 NaN NaN 23.0 0.0 NaN NaN 45.0 2.0
# Step 4
print(df.T.apply(lambda x: x.shift(
len(x) - x.index.get_loc(x.last_valid_index()) - 1)).T
.dropna(how='all', axis='columns'))
# year 2017 2018 2019 2017 2018 2019
# appid ttl ttl ttl tta tta tta
# 75787 227.0 470.0 426.0 25.0 23.0 21.0
# 146306 NaN 858.0 226.0 NaN 14.0 35.0
# 159479 NaN NaN 0.0 NaN NaN 3.5
# 163618 0.0 650.0 100.0 12.0 14.6 123.0
# 215968 NaN 23.0 0.0 NaN 45.0 2.0