it's a modified version of my previous question: I'm trying to run LIME on my quanteda
text model that feeds off Trump & Clinton tweets data. I run it following an example given by Thomas Pedersen in his Understanding LIME and useuful SO answer provided by @Weihuang Wong:
library(dplyr)
library(stringr)
library(quanteda)
library(lime)
#data prep
tweet_csv <- read_csv("tweets.csv")
# creating corpus and dfm for train and test sets
get_matrix <- function(df){
corpus <- quanteda::corpus(df)
dfm <- quanteda::dfm(corpus, remove_url = TRUE, remove_punct = TRUE, remove = stopwords("english"))
}
set.seed(32984)
trainIndex <- sample.int(n = nrow(tweet_csv), size = floor(.8*nrow(tweet_csv)), replace = F)
train_dfm <- get_matrix(tweet_csv$text[trainIndex])
train_raw <- tweet_csv[, c("text", "tweet_num")][as.vector(trainIndex), ]
train_labels <- tweet_csv$author[as.vector(trainIndex)] == "realDonaldTrump"
test_dfm <- get_matrix(tweet_csv$text[-trainIndex])
test_raw <- tweet_csv[, c("text", "tweet_num")][-as.vector(trainIndex), ]
test_labels <- tweet_csv$author[-as.vector(trainIndex)] == "realDonaldTrump"
#### make sure that train & test sets have exactly same features
test_dfm <- dfm_select(test_dfm, train_dfm)
### Naive Bayes model using quanteda::textmodel_nb ####
nb_model <- quanteda::textmodel_nb(train_dfm, train_labels)
nb_preds <- predict(nb_model, test_dfm) #> 0.5
# select only correct predictions
predictions_tbl <- data.frame(predict_label = nb_preds$nb.predicted,
actual_label = test_labels,
tweet_name = rownames(nb_preds$posterior.prob)
) %>%
mutate(tweet_num =
as.integer(
str_trim(
str_replace_all(tweet_name, "text", ""))
))
correct_pred <- predictions_tbl %>%
filter(actual_label == predict_label)
# pick a sample of tweets for explainer
tweets_to_explain <- test_raw %>%
filter(tweet_num %in% correct_pred$tweet_num) %>%
head(4)
### set up correct model class and predict functions
class(nb_model)
model_type.textmodel_nb_fitted <- function(x, ...) {
return("classification")
}
# have to modify the textmodel_nb_fitted so that
predict_model.textmodel_nb_fitted <- function(x, newdata, type, ...) {
X <- corpus(newdata)
X <- dfm_select(dfm(X), x$data$x)
res <- predict(x, newdata = X, ...)
switch(
type,
raw = data.frame(Response = res$nb.predicted, stringsAsFactors = FALSE),
prob = as.data.frame(res$posterior.prob, check.names = FALSE)
)
}
### run the explainer - no problems here
explainer <- lime(tweets_to_explain$text, # lime returns error on different features in explainer and explanations, even if I use the same dataset in both. Raised an issue on Github and asked a question on SO
model = nb_model,
preprocess = get_matrix)
But when I run the explainer...
corr_explanation <- lime::explain(tweets_to_explain$text,
explainer,
n_labels = 1,
n_features = 6,
cols = 2,
verbose = 0)
... I get the following error:
Error in UseMethod("corpus") : no applicable method for 'corpus' applied to an object of class "c('dfm', 'dgCMatrix', 'CsparseMatrix', 'dsparseMatrix', 'generalMatrix', 'dCsparseMatrix', 'dMatrix', 'sparseMatrix', 'compMatrix', 'Matrix', 'xMatrix', 'mMatrix', 'Mnumeric', 'replValueSp')"
It goes back to applying corpus()
to newdata
:
5.corpus(newdata)
4.predict_model.textmodel_nb_fitted(x = explainer$model, newdata = permutations_tokenized,
type = o_type)
3.predict_model(x = explainer$model, newdata = permutations_tokenized,
type = o_type)
2.explain.character(tweets_to_explain$text, explainer, n_labels = 1,
n_features = 6, cols = 2, verbose = 0)
1.lime::explain(tweets_to_explain$text, explainer, n_labels = 1,
n_features = 6, cols = 2, verbose = 0)
But I don't understand why should this cause any issues as new data is a text vector?
Thanks for any hints
corpus
doesn't have to be run. Try redefining predict_model.textmodel_nb_fitted
as follows, where the only modification is to add the dfm_select
step:
predict_model.textmodel_nb_fitted <- function(x, newdata, type, ...) {
X <- dfm_select(dfm(newdata), x$data$x)
res <- predict(x, newdata = X, ...)
switch(
type,
raw = data.frame(Response = res$nb.predicted, stringsAsFactors = FALSE),
prob = as.data.frame(res$posterior.prob, check.names = FALSE)
)
}
As your traceback()
output shows, corpus
throws an error. To debug, I inserted print(str(newdata))
in the first line of the predict_model.textmodel_nb_fitted
function. This shows that newdata
is already a dfm
object, so it can be passed directly into predict.textmodel_nb_fitted
(after processing it with dfm_select
).
In more recent versions of quanteda
, textmodel_nb()
returns an object of classes textmodel_nb
,textmodel
, and list
. This would first require a corresponding method for model_type
:
model_type.textmodel_nb <- function(x, ...) {
return("classification")
}
We then also have to write a textmodel_nb
method for predict_model
:
predict_model.textmodel_nb <- function(x, newdata, type, ...) {
X <- dfm_select(dfm(newdata), x$x)
res <- predict(x, newdata = X, ...)
switch(
type,
raw = data.frame(Response = res$nb.predicted, stringsAsFactors = FALSE),
prob = as.data.frame(res$posterior.prob, check.names = FALSE)
)
}
Notice that the second argument to dfm_select
is different from that in predict_model.textmodel_nb_fitted
(from the original version of the answer). This is because the structure of the x
object -- the output from textmodel_nb()
-- has changed.