I was playing around with LDA in the text2vec
package and was confused why the fit_transfrom
and transform
were different when using the same data.
The documentation states that transform applys the learned model to new data but the result is a lot different than the one produced from fit_transform
data("movie_review")
library(stringr)
library(text2vec)
library(dpylr)
tokens = movie_review$review[1:4000] %>%
tolower %>%
word_tokenizer
it = itoken(tokens, ids = movie_review$id[1:4000], progressbar = FALSE)
v = create_vocabulary(it) %>%
prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer, type = "dgTMatrix")
lda_model = LDA$new(n_topics = 10, doc_topic_prior = 0.1, topic_word_prior = 0.01)
set.seed(123)
doc_topic_distr =
lda_model$fit_transform(x = dtm, n_iter = 1000,
convergence_tol = 0.001, n_check_convergence = 25,
progressbar = FALSE)
set.seed(123)
new_doc_topic_dist =
lda_model$transform(x = dtm, n_iter = 1000,
convergence_tol = 0.001, n_check_convergence = 25,
progressbar = FALSE)
head(doc_topic_distr)
head(new_doc_topic_dist)
I expected both doc_topic_distr
and new_doc_topic_distr
to be the same but they are quite different.
Good question! Indeed there is an issue with CRAN version (and it mostly fixed in dev version on github). The issue is following:
fit_transform
we learn both document-topic distribution and word-topic distribution. Once converged we save word-topic inside the model and return document-topic as result.transform
we use fixed word-topic distribution and only infer document-topic. There is no guarantee that inferred document-topic will be the same and during fit_transform
(but it should be close enough).What we've changed in dev version - we run fit_transform
and transform
in order to have almost same document-topic distribution for each methods. (there are couple additional parameter tweaks in order to make sure they are exactly the same - see documentation for development version).