pythonpandasmulti-index

change multiindex level: "ValueError: On level 0, code max >= length of level"


I am having a hard time changing some code that was adjusting the 'date' part of a MultiIndex to MonthEnd like this (relying on the fact that the 'date' part was at position 0):

offset = pd.offsets.MonthEnd()
df.index.set_levels(df.index.levels[0] + offset, level=0, inplace=True)

The inplace argument is marked as deprecated in pandas=1.2.1 (and for good reasons, I'm all in favor).

While refactoring the code, I figured I would also like to use a named level ('date'), rather than the int level (0), for easier readability and maintainability. So, I wrote:

level = 'date'
mi = df.index
df = df.set_index(mi.set_level(mi.unique(level) + offset, level=level)

That worked nicely, until it encountered a df that was a copy of another one with a subset of the MultiIndex.

Consider the following set up for a minimal example:

def get_example_df(notbefore=None):
    np.random.seed(0)
    n = 3
    dates = pd.date_range('2000', freq='MS', periods=n)
    names = list('ab')
    df = pd.DataFrame(
        np.random.randint(10, size=n * len(names)),
        columns=['x'],
        index=pd.MultiIndex.from_product([dates, names],
                                         names=('date', 'name'))
    )
    if notbefore:
        dates = df.index.get_level_values('date')
        df = df.loc[dates >= notbefore]
    return df


level = 'date'
offset = pd.offsets.MonthEnd()

Without truncation, all is well:

>>> df = get_example_df()
>>> df
                 x
date       name   
2000-01-01 a     5
           b     0
2000-02-01 a     3
           b     3
2000-03-01 a     7
           b     9

# note:
>>> df.index.codes[0]
array([0, 0, 1, 1, 2, 2], dtype=int8)

>>> mi = df.index
>>> df.set_index(mi.set_levels(mi.unique(level) + offset, level=level))
                 x
date       name   
2000-01-31 a     5
           b     0
2000-02-29 a     3
           b     3
2000-03-31 a     7
           b     9

However, when the MultiIndex is a view (because of notbefore being not None), then it goes rather badly:

>>> df = get_example_df(notbefore='2000-02-15')
>>> df
                 x
date       name   
2000-03-01 a     7
           b     9

>>> mi = df.index
>>> df.set_index(mi.set_levels(mi.unique(level) + offset, level=level))
...
ValueError: On level 0, code max (2) >= length of level (1). NOTE: this index is in an inconsistent state

It turns out that the problem is that the mi.codes[0] don't start at 0 when the df is the truncated one:

>>> df.index.codes[0]
array([2, 2], dtype=int8)

So we have the unfortunate situation that:

>>> len(df.index.levels[0])
3

>>> len(df.index.get_level_values(level))
2

>>> len(df.index.unique(level))
1

and the only one that can be assigned (after adding the offset) back to the level is df.index.levels[0].

The only thing I can figure for my new code that seems to reliably work is:

level_idx = df.index.names.index('date')
# level_idx is now 0
mi = df.index
mi = mi.set_levels(mi.levels[level_idx] + offset, level=level_idx)

And now:

>>> mi
MultiIndex([('2000-03-31', 'a'),
            ('2000-03-31', 'b')],
           names=['date', 'name'])

>>> mi.codes[0]
array([2, 2], dtype=int8)  # as before

That feels wrong. It would be good to have a .set_unique() that would be a robust counterpart to .unique(), even if .codes don't start at 0. It is also inefficient (e.g. when there are much fewer .unique() values than the full length of the MultiIndex.

Am I missing something?


Solution

  • based on our discussion, I think you may want to remove the unused levels:

    This was new in pandas version: New in version 0.20.0.

    mi = df1.index.remove_unused_levels()
    df1.set_index(mi.set_levels(mi.unique(level) + offset, level=level))