Situation: I am currently working on visualizing the results of a huggingface transformers machine learning model I have been building using the LIME package following this tutorial.
Complication: My code is set up and runs well until I create the LIME explainer object. At this point I get a memory error.
Question: What am I doing wrong? Why am I running into a memory error?
Code: Here is my code (you should be able to just copy-paste this into google colab and run the whole thing)
########################## LOAD PACKAGES ######################
# Install new packages in our environment
!pip install lime
!pip install wget
!pip install transformers
# Import general libraries
import sklearn
import sklearn.ensemble
import sklearn.metrics
import numpy as np
import pandas as pd
# Import libraries specific to this notebook
import lime
import wget
import os
from __future__ import print_function
from transformers import FeatureExtractionPipeline, BertModel, BertTokenizer, BertConfig
from lime.lime_text import LimeTextExplainer
# Let the notebook know to plot inline
%matplotlib inline
########################## LOAD DATA ##########################
# Get URL
url = 'https://nyu-mll.github.io/CoLA/cola_public_1.1.zip'
# Download the file (if we haven't already)
if not os.path.exists('./cola_public_1.1.zip'):
wget.download(url, './cola_public_1.1.zip')
# Unzip the dataset (if we haven't already)
if not os.path.exists('./cola_public/'):
!unzip cola_public_1.1.zip
# Load the dataset into a pandas dataframe.
df_cola = pd.read_csv("./cola_public/raw/in_domain_train.tsv", delimiter='\t',
header=None, names=['sentence_source', 'label',
'label_notes', 'sentence'])
# Only look at the first 50 observations for debugging
df_cola = df_cola.head(50)
###################### TRAIN TEST SPLIT ######################
# Apply the train test split
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(
df_cola.sentence, df_cola.label, test_size=0.2, random_state=42
)
###################### CREATE LIME CLASSIFIER ######################
# Create a function to extract vectors from a single sentence
def vector_extractor(sentence):
# Create a basic BERT model, config and tokenizer for the pipeline
configuration = BertConfig()
configuration.max_len = 64
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased',
do_lower_case=True,
max_length=64,
pad_to_max_length=True)
model = BertModel.from_pretrained('bert-base-uncased',config=configuration)
# Create the pipeline
vector_extractor = FeatureExtractionPipeline(model=model,
tokenizer=tokenizer,
device=0)
# The pipeline gives us all tokens in the final layer - we want the CLS token
vector = vector_extractor(sentence)
vector = vector[0][0]
# Return the vector
return vector
# Adjust the format of our sentences (from pandas series to python list)
x_train = x_train.values.tolist()
x_test = x_test.values.tolist()
# First we vectorize our train features for the classifier
x_train_vectorized = [vector_extractor(x) for x in x_train]
# Create and fit the random forest classifier
rf = sklearn.ensemble.RandomForestClassifier(n_estimators=100)
rf.fit(x_train_vectorized, y_train)
# Define the lime_classifier function
def lime_classifier(sentences):
# Turn all the sentences into vectors
vectors = [vector_extractor(x) for x in sentences]
# Get predictions for all
predictions = rf.predict_proba(vectors)
# Return the probabilies as a 2D-array
return predictions
########################### APPLY LIME ##########################
# Create the general explainer object
explainer = LimeTextExplainer()
# "Fit" the explainer object to a specific observation
exp = explainer.explain_instance(x_test[1],
lime_classifier,
num_features=6)
Ended up solving this by re-implementing along the lines of this GitHub post: https://github.com/marcotcr/lime/issues/409
My code is now very different from the above - probably makes sense if you look to the GitHub post for guidance if you're running into similar issues.