rnlpldatopic-modelingtopicmodels

Why are LDA predictions incorrect


Step 1

I'm using R and the "topicmodels" package to build a LDA model from a 4.5k documents corpus. I do the usual pre-processing steps (stopwords, cut low/high words frequencies, lemmatization) and end up with a 100 topics model that I'm happy with. In fact, it's an almost perfect model for my needs.

justlda <- LDA(k=100, x=dtm_lemma, method="Gibbs", control=control_list_gibbs)

Step 2

I then pre-process using the same exact process as above a new (unseen by the model) 300 documents corpus, then transform it into a document-term matrix, then use the "posterior" function of the same package to predict the topics on the new data. This corpus is coming from the same authors and is very similar to the training set.

My problem

The predictions (posterior probabilities) I get are totally wrong. This is the code I'm using to get the posterior:

topics = posterior(justlda, dtm_lemma, control = control_list_gibbs)$topics

I feel that not only are the predictions wrong, the topics weights are very low. Nothing is coming out as a dominant topic. (For this 100 topics model, most topics come out as 0.08 and I'm lucky to get a 0.20 weight that is not even relevant...)

I got less than a year of experience with NLP/LDA and the R language. I feel I could be making a very amateur mistake somewhere that could explain the wrong predictions?

Is this kind of results normal? What could I be possibly doing wrong?


Solution

  • I'm not 100% sure what you mean by 'wrong'. I did a quick test to see if posterior works on new data. First I run a model with all documents of the AssociatedPress dataset:

    library(topicmodels)
    data("AssociatedPress")
    ap_lda <- LDA(AssociatedPress, k = 5, control = list(seed = 1234))
    

    From your question, I suspect you are looking at most likely topics for each document here. To keep it comparable, I build my own way of finding these out here, based on some tidy packages:

    library(tidytext)
    library(dplyr)
    library(tidyr)
    ap_documents <- tidy(ap_lda, matrix = "gamma")
    ap_documents %>% 
      group_by(document) %>% 
      top_n(1, gamma) %>% # keep only most likely topic
      arrange(document)
    # A tibble: 2,246 x 3
    # Groups:   document [2,246]
       document topic gamma
          <int> <int> <dbl>
     1        1     4 0.999
     2        2     2 0.529
     3        3     4 0.999
     4        4     4 0.518
     5        5     4 0.995
     6        6     2 0.971
     7        7     1 0.728
     8        8     2 0.941
     9        9     4 0.477
    10       10     5 0.500
    # ... with 2,236 more rows
    

    Now I run the same LDA again but withhold the first 10 documents:

    AssociatedPress_train <- AssociatedPress[11:nrow(AssociatedPress), ]
    AssociatedPress_test <- AssociatedPress[1:10, ]
    
    ap_lda <- LDA(AssociatedPress_train, k = 5, control = list(seed = 1234))
    

    I use posterior to get the gamma values for each document and again just keep the most likely:

    posterior(object = ap_lda, newdata = AssociatedPress_test)$topics %>%
      as_tibble() %>% 
      mutate(document = seq_len(nrow(.))) %>% 
      gather(topic, gamma, -document) %>% 
      group_by(document) %>% 
      top_n(1, gamma) %>% # keep only most probable topic
      arrange(document)
    # A tibble: 10 x 3
    # Groups:   document [10]
       document topic gamma
          <int> <chr> <dbl>
     1        1 4     0.898
     2        2 2     0.497
     3        3 4     0.896
     4        4 4     0.468
     5        5 4     0.870
     6        6 2     0.754
     7        7 1     0.509
     8        8 2     0.913
     9        9 4     0.476
    10       10 2     0.399
    

    All but document 10 have the same most likely topic as before. So everything seems to work fine! So I don't see an immediate problem with your code.

    One thing I haven't tested is what happens if the DTM of training and test set have different columns. I suspect that would be a problem.

    Here is a quick example of how you could deal with that:

    text1 <- tibble(doc = 1, word = LETTERS[1:10])
    text2 <- tibble(doc = 1, word = LETTERS[2:11])
    dtm1 <- text1 %>%
      count(doc, word) %>%
      arrange(word) %>%
      cast_dtm(doc, word, n)
    
    dtm2 <- text2 %>%
      count(doc, word) %>%
      arrange(word) %>%
      cast_dtm(doc, word, n)
    
    all.equal(dtm1$dimnames$Terms, dtm2$dimnames$Terms)
    [1] "10 string mismatches"
    

    I make two DTMs where the second one has an extra term and lacks one term from the other. The dimnames are thus different. We can make them equal by a bringing the DTM back into a tidy format, remove the extra term and add the missing terms before casting the DTM again:

    dtm2_clean <- tidy(dtm2) %>% 
      filter(term %in% dtm1$dimnames$Terms) %>% 
      rbind(tibble(document = 1, 
                   term = dtm1$dimnames$Terms, # adding term but no counts
                   count = 0)) %>% 
      arrange(term) %>% 
      cast_dtm(document, term, count)
    
    all.equal(dtm1$dimnames$Terms, dtm2_clean$dimnames$Terms)
    [1] TRUE
    

    You can now use this as newdata for posterior.