rnlpldatext2vec

Why do fit_transform and transform produce different results?


I was playing around with LDA in the text2vec package and was confused why the fit_transfrom and transform were different when using the same data.

The documentation states that transform applys the learned model to new data but the result is a lot different than the one produced from fit_transform

data("movie_review")
library(stringr)
library(text2vec)
library(dpylr)

tokens = movie_review$review[1:4000] %>% 
  tolower %>% 
  word_tokenizer

it = itoken(tokens, ids = movie_review$id[1:4000], progressbar = FALSE)

v = create_vocabulary(it) %>% 
  prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)

vectorizer = vocab_vectorizer(v)

dtm = create_dtm(it, vectorizer, type = "dgTMatrix")

lda_model = LDA$new(n_topics = 10, doc_topic_prior = 0.1, topic_word_prior = 0.01)

set.seed(123)

doc_topic_distr = 
  lda_model$fit_transform(x = dtm, n_iter = 1000, 
                          convergence_tol = 0.001, n_check_convergence = 25, 
                          progressbar = FALSE)

set.seed(123)

new_doc_topic_dist = 
  lda_model$transform(x = dtm, n_iter = 1000, 
                          convergence_tol = 0.001, n_check_convergence = 25, 
                          progressbar = FALSE)

head(doc_topic_distr)
head(new_doc_topic_dist)

I expected both doc_topic_distr and new_doc_topic_distr to be the same but they are quite different.


Solution

  • Good question! Indeed there is an issue with CRAN version (and it mostly fixed in dev version on github). The issue is following:

    1. During fit_transform we learn both document-topic distribution and word-topic distribution. Once converged we save word-topic inside the model and return document-topic as result.
    2. During transform we use fixed word-topic distribution and only infer document-topic. There is no guarantee that inferred document-topic will be the same and during fit_transform (but it should be close enough).

    What we've changed in dev version - we run fit_transform and transform in order to have almost same document-topic distribution for each methods. (there are couple additional parameter tweaks in order to make sure they are exactly the same - see documentation for development version).