pythonpandasscikit-learnlime

How to get original string values of encoded categorical columns in Lime graph


I am trying to work on local explainability using Lime graph. Before building the model, I encode some of the categorical variables.

Sample Data and code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

df = pd.DataFrame({'customer_id' : np.arange(1,21),
                  'gender' : np.random.choice(['male','female'], 20),
                  'age' : np.random.randint(19,50, 20),
                  'salary' : np.random.randint(20000,95000, 20),
                  'purchased' : np.random.choice([0,1], 20, p = [.8,.2])})

Preprocessing:

df['gender'] = df['gender'].map({'female' : 0, 'male' : 1})

df['age'] = df['age'].map(lambda x : 'young' if x<=35 else 'middle aged')

df['age'] = df['age'].map({'young' : 0, 'middle aged' : 1})

bins = [0, df['salary'].quantile(q=.33),df['salary'].quantile(q=.66),df['salary'].quantile(q=1)+1]
labels = ['low salary', 'medium salary', 'high salary']
df['salary'] = pd.cut(df['salary'], bins = bins, labels=labels)

from sklearn import preprocessing
l_encoder={}
label_encoder = preprocessing.LabelEncoder()
df['salary']= label_encoder.fit_transform(df['salary'])
df

    customer_id gender  age salary  purchased
0   1           0       0   1       0
1   2           0       0   0       0
2   3           0       1   2       0
3   4           1       0   0       0
4   5           1       1   2       0
5   6           0       1   1       0
6   7           1       0   2       0
7   8           1       1   0       0
8   9           1       1   1       0
9   10          1       0   0       0
10  11          0       1   0       0
11  12          0       0   1       0
12  13          1       1   1       0
13  14          1       1   1       0
14  15          1       1   2       1
15  16          1       1   0       0
16  17          1       1   1       0
17  18          0       0   0       0
18  19          0       0   2       0
19  20          0       0   2       0


# input
x = df.iloc[:, :-1]
  
# output
y = df.iloc[:, 4]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 0)

Separating the customer_id column:

X_train_cust = X_train.pop('customer_id')
X_test_cust = X_test.pop('customer_id')

Fitting a logistic regression model:

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

Building a lime chart:

import lime
import lime.lime_tabular

explainer = lime.lime_tabular.LimeTabularExplainer(np.array(X_train),
                                feature_names=X_train.columns,
                                                  verbose=True, mode = 'classification')

exp = explainer.explain_instance(X_test.iloc[0], classifier.predict_proba)

exp.as_pyplot_figure()

enter image description here

The lime chart displays the encoded features/columns values. But I need the original value. For example, if the lime chart says 0, I need to display it as female. Could someone please let me know how fix it.


Solution

  • You can use:

    # Your direct mapping dictionary
    dmap = {'gender': {'female' : 0, 'male' : 1},
            'age': {'young' : 0, 'middle aged' : 1},
            'salary': {'low salary': 0, 'medium salary': 1, 'high salary': 2}}
    
    # Reverse mapping dictionary (not used hear)
    rmap = {col: {v: k for k, v in dm.items()} for col, dm in dmap.items()}
    
    # Categorical names, col0->gender, col1->age, col2->salary
    cmap = {c: list(d.keys()) for c, d in enumerate(dmap.values())}
    
    
    # Now use
    explainer = lime.lime_tabular.LimeTabularExplainer(np.array(X_train),
                                    feature_names=X_train.columns,
                                    categorical_features=[0, 1, 2],  # <- 3 first columns
                                    categorical_names=cmap,  # <- int to string
                                    verbose=True, mode = 'classification')
    
    exp = explainer.explain_instance(X_test.iloc[0], classifier.predict_proba)
    
    exp.as_pyplot_figure()
    

    enter image description here

    Reat this tutorial