pythonpandasmulti-indexquerying

querying a multiindex pandas dataframe with slices


Assuming I have the following multiindex DF import pandas as pd

import numpy as np
import pandas as pd
input_id = np.array(['12345'])
docType = np.array(['pre','pub','app','dw'])
docId = np.array(['34455667'])
sec_type = np.array(['bib','abs','cl','de'])
sec_ids = np.array(['x-y','z-k'])
index = pd.MultiIndex.from_product([input_id,docType,docId,sec_type,sec_ids])
content= [str(randint(1,10))+ '##' + str(randint(1,10)) for i in range(len(index))]
df = pd.DataFrame(content, index=index, columns=['content'])
df.rename_axis(index=['input_id','docType','docId','secType','sec_ids'], inplace=True)
df

I know that I can query a multiindex DF as follows:

# querying a multiindex DF
idx = pd.IndexSlice
df.loc[idx[:,['pub','pre'],:,'de',:]]

basically with the help of pd.IndexSlice I can pass the values I want for every of the indexes. In the above case I want the resulting DF where the second index is 'pub' OR 'pre' and the 4th one is 'de'.

I am looking for the way to pass a range of values to the query. something like multiindex 3 beeing between 34567 and 45657. Assume those are integers.

pseudocode: df.loc[idx[:,['pub','pre'],XXXXX,'de',:]]
XXXX = ?

EDIT 1: docId column index is of text type, probably its necessary to change it first to int


Solution

  • Turns out query is very powerful:

    df.query('docType in ["pub","pre"] and ("34455667" <= docId <= "3445568") and (secType=="de")')
    

    Output:

                                              content
    input_id docType docId    secType sec_ids        
    12345    pre     34455667 de      x-y        2##9
                                      z-k        6##1
             pub     34455667 de      x-y        6##5
                                      z-k        9##8