pythonpandasmulti-index

How to remove <indexing past lexsort depth may impact performance?">


I've a dataframe with a non-unique MultiIndex:

           A    B
L1  L2           
7.0 7.0 -0.4 -0.1
8.0 5.0 -2.1  1.6
5.0 8.0 -1.8 -0.8
7.0 7.0  0.5 -1.2
    NaN -1.1 -0.9
5.0 8.0  0.6  2.3

I want to select some rows using a tuple of values:

data = df.loc[(7, 7), :]

With no surprise a warning is triggered:

PerformanceWarning: indexing past lexsort depth may impact performance.

I'm trying to understand what in the current index causes this warning. I've read many answers here, some are related to old versions of pandas, other helped. From what I've read the warning is caused by two properties:

So I'm processing the dataframe index with this function designed from the answers found on this stack:

def adjust_index(df):
    df = df.sort_index() # sort index
    levels = list(range(len(df.index.levels)))
    df_idx = df.groupby(level=levels).cumcount() # unique index
    df_adj = df.set_index(df_idx, append=True) # change index
    df_adj = df_adj.reset_index(level=-1, drop=True) # drop sorting level
    return df_adj

This doesn't remove the warning. Can you explain what is wrong, useless or missing?

The rest of the code:

import pandas as pd
from numpy import nan, random as npr
npr.seed(2)
    
# Dataframe with unsorted MultiIndex
def create_df():
    n_rows = 6
    data = npr.randn(n_rows, 2).round(1)
    choices = [8, 7, 5, 7, 8, nan]
    columns = ['A', 'B']
    levels = ['L1', 'L2']
    tuples = list(zip(npr.choice(choices, n_rows), npr.choice(choices, n_rows)))
    index = pd.MultiIndex.from_tuples(tuples, names=levels)
    df = pd.DataFrame(data, index=index, columns=columns)
    return df

df = create_df()
df = adjust_index(df)
data = df.loc[(7, 7), :] # <-- triggers warning

Solution

  • I got rid of the warning by sorting the index and putting the NaN values first:

    df.sort_index(inplace=True, na_position="first")
    data = df.loc[(7, 7), :]
    
    print(data)
    

    Prints:

               A    B
    L1  L2           
    7.0 7.0 -0.4 -0.1
        7.0  0.5 -1.2
    

    I think the issue is with the NaN value you have in index. Pandas has special index.codes for each unique value in index and NaN is encoded as -1. So to have sorted index you have to have this -1 value on first position, hence na_position="first"