I'm using R and the "topicmodels" package to build a LDA model from a 4.5k documents corpus. I do the usual pre-processing steps (stopwords, cut low/high words frequencies, lemmatization) and end up with a 100 topics model that I'm happy with. In fact, it's an almost perfect model for my needs.
justlda <- LDA(k=100, x=dtm_lemma, method="Gibbs", control=control_list_gibbs)
I then pre-process using the same exact process as above a new (unseen by the model) 300 documents corpus, then transform it into a document-term matrix, then use the "posterior" function of the same package to predict the topics on the new data. This corpus is coming from the same authors and is very similar to the training set.
The predictions (posterior probabilities) I get are totally wrong. This is the code I'm using to get the posterior:
topics = posterior(justlda, dtm_lemma, control = control_list_gibbs)$topics
I feel that not only are the predictions wrong, the topics weights are very low. Nothing is coming out as a dominant topic. (For this 100 topics model, most topics come out as 0.08 and I'm lucky to get a 0.20 weight that is not even relevant...)
I got less than a year of experience with NLP/LDA and the R language. I feel I could be making a very amateur mistake somewhere that could explain the wrong predictions?
Is this kind of results normal? What could I be possibly doing wrong?
I'm not 100% sure what you mean by 'wrong'. I did a quick test to see if posterior
works on new data. First I run a model with all documents of the AssociatedPress
dataset:
library(topicmodels)
data("AssociatedPress")
ap_lda <- LDA(AssociatedPress, k = 5, control = list(seed = 1234))
From your question, I suspect you are looking at most likely topics for each document here. To keep it comparable, I build my own way of finding these out here, based on some tidy packages:
library(tidytext)
library(dplyr)
library(tidyr)
ap_documents <- tidy(ap_lda, matrix = "gamma")
ap_documents %>%
group_by(document) %>%
top_n(1, gamma) %>% # keep only most likely topic
arrange(document)
# A tibble: 2,246 x 3
# Groups: document [2,246]
document topic gamma
<int> <int> <dbl>
1 1 4 0.999
2 2 2 0.529
3 3 4 0.999
4 4 4 0.518
5 5 4 0.995
6 6 2 0.971
7 7 1 0.728
8 8 2 0.941
9 9 4 0.477
10 10 5 0.500
# ... with 2,236 more rows
Now I run the same LDA again but withhold the first 10 documents:
AssociatedPress_train <- AssociatedPress[11:nrow(AssociatedPress), ]
AssociatedPress_test <- AssociatedPress[1:10, ]
ap_lda <- LDA(AssociatedPress_train, k = 5, control = list(seed = 1234))
I use posterior
to get the gamma values for each document and again just keep the most likely:
posterior(object = ap_lda, newdata = AssociatedPress_test)$topics %>%
as_tibble() %>%
mutate(document = seq_len(nrow(.))) %>%
gather(topic, gamma, -document) %>%
group_by(document) %>%
top_n(1, gamma) %>% # keep only most probable topic
arrange(document)
# A tibble: 10 x 3
# Groups: document [10]
document topic gamma
<int> <chr> <dbl>
1 1 4 0.898
2 2 2 0.497
3 3 4 0.896
4 4 4 0.468
5 5 4 0.870
6 6 2 0.754
7 7 1 0.509
8 8 2 0.913
9 9 4 0.476
10 10 2 0.399
All but document 10 have the same most likely topic as before. So everything seems to work fine! So I don't see an immediate problem with your code.
One thing I haven't tested is what happens if the DTM of training and test set have different columns. I suspect that would be a problem.
Here is a quick example of how you could deal with that:
text1 <- tibble(doc = 1, word = LETTERS[1:10])
text2 <- tibble(doc = 1, word = LETTERS[2:11])
dtm1 <- text1 %>%
count(doc, word) %>%
arrange(word) %>%
cast_dtm(doc, word, n)
dtm2 <- text2 %>%
count(doc, word) %>%
arrange(word) %>%
cast_dtm(doc, word, n)
all.equal(dtm1$dimnames$Terms, dtm2$dimnames$Terms)
[1] "10 string mismatches"
I make two DTMs where the second one has an extra term and lacks one term from the other. The dimnames are thus different. We can make them equal by a bringing the DTM back into a tidy format, remove the extra term and add the missing terms before casting the DTM again:
dtm2_clean <- tidy(dtm2) %>%
filter(term %in% dtm1$dimnames$Terms) %>%
rbind(tibble(document = 1,
term = dtm1$dimnames$Terms, # adding term but no counts
count = 0)) %>%
arrange(term) %>%
cast_dtm(document, term, count)
all.equal(dtm1$dimnames$Terms, dtm2_clean$dimnames$Terms)
[1] TRUE
You can now use this as newdata for posterior.