I am having a hard time changing some code that was adjusting the 'date'
part of a MultiIndex
to MonthEnd
like this (relying on the fact that the 'date'
part was at position 0):
offset = pd.offsets.MonthEnd()
df.index.set_levels(df.index.levels[0] + offset, level=0, inplace=True)
The inplace
argument is marked as deprecated in pandas=1.2.1
(and for good reasons, I'm all in favor).
While refactoring the code, I figured I would also like to use a named level ('date'
), rather than the int
level (0
), for easier readability and maintainability. So, I wrote:
level = 'date'
mi = df.index
df = df.set_index(mi.set_level(mi.unique(level) + offset, level=level)
That worked nicely, until it encountered a df
that was a copy of another one with a subset of the MultiIndex
.
Consider the following set up for a minimal example:
def get_example_df(notbefore=None):
np.random.seed(0)
n = 3
dates = pd.date_range('2000', freq='MS', periods=n)
names = list('ab')
df = pd.DataFrame(
np.random.randint(10, size=n * len(names)),
columns=['x'],
index=pd.MultiIndex.from_product([dates, names],
names=('date', 'name'))
)
if notbefore:
dates = df.index.get_level_values('date')
df = df.loc[dates >= notbefore]
return df
level = 'date'
offset = pd.offsets.MonthEnd()
Without truncation, all is well:
>>> df = get_example_df()
>>> df
x
date name
2000-01-01 a 5
b 0
2000-02-01 a 3
b 3
2000-03-01 a 7
b 9
# note:
>>> df.index.codes[0]
array([0, 0, 1, 1, 2, 2], dtype=int8)
>>> mi = df.index
>>> df.set_index(mi.set_levels(mi.unique(level) + offset, level=level))
x
date name
2000-01-31 a 5
b 0
2000-02-29 a 3
b 3
2000-03-31 a 7
b 9
However, when the MultiIndex
is a view (because of notbefore
being not None
), then it goes rather badly:
>>> df = get_example_df(notbefore='2000-02-15')
>>> df
x
date name
2000-03-01 a 7
b 9
>>> mi = df.index
>>> df.set_index(mi.set_levels(mi.unique(level) + offset, level=level))
...
ValueError: On level 0, code max (2) >= length of level (1). NOTE: this index is in an inconsistent state
It turns out that the problem is that the mi.codes[0]
don't start at 0 when the df
is the truncated one:
>>> df.index.codes[0]
array([2, 2], dtype=int8)
So we have the unfortunate situation that:
>>> len(df.index.levels[0])
3
>>> len(df.index.get_level_values(level))
2
>>> len(df.index.unique(level))
1
and the only one that can be assigned (after adding the offset
) back to the level is df.index.levels[0]
.
The only thing I can figure for my new code that seems to reliably work is:
level_idx = df.index.names.index('date')
# level_idx is now 0
mi = df.index
mi = mi.set_levels(mi.levels[level_idx] + offset, level=level_idx)
And now:
>>> mi
MultiIndex([('2000-03-31', 'a'),
('2000-03-31', 'b')],
names=['date', 'name'])
>>> mi.codes[0]
array([2, 2], dtype=int8) # as before
That feels wrong. It would be good to have a .set_unique()
that would be a robust counterpart to .unique()
, even if .codes
don't start at 0. It is also inefficient (e.g. when there are much fewer .unique()
values than the full length of the MultiIndex
.
Am I missing something?
based on our discussion, I think you may want to remove the unused levels
:
This was new in pandas version: New in version 0.20.0.
mi = df1.index.remove_unused_levels()
df1.set_index(mi.set_levels(mi.unique(level) + offset, level=level))