rtopicmodels

How to create a grid search to find best parameters?


In lda analysis

library(topicmodels)
    # parameters for Gibbs sampling
    burnin <- 4000
    iter <- 2000
    thin <- 500
    seed <-list(1969,5,25,102855,2012)
    nstart <- 5
    best <- TRUE
    #Number of topics
    k <- 10
library(topicmodels)
data("AssociatedPress", package = "topicmodels")



    #Run LDA with Gibbs
    ldaOut <-LDA(AssociatedPress[1:20,], k, method="Gibbs", control=list(nstart=nstart, seed = seed, best = best, burnin =
    burnin, iter = iter, thin=thin)) 

How is it possible to create a grid search to find the best values for parameters?


Solution

  • The package ldatuning can help you in finding the number of topic models. See code below. Be careful not to run the full associated press dataset. That might run a few hours.

    For tuning several metrics are used. You can read up on these in the references of the vignette with ldatuning.

    library(ldatuning)
    library(topicmodels)
    data("AssociatedPress", package="topicmodels")
    
    my_dtm <- AssociatedPress[1:20,]
    
    result <- FindTopicsNumber(
      my_dtm,
      topics = seq(from = 2, to = 10, by = 1),
      metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
      method = "Gibbs",
      control = list(seed = 4242),
      mc.cores = 3L,
      verbose = TRUE
    )
    
    fit models... done.
    calculate metrics:
      Griffiths2004... done.
      CaoJuan2009... done.
      Arun2010... done.
      Deveaud2014... done.
    
    result
    
      topics Griffiths2004 CaoJuan2009 Arun2010 Deveaud2014
    1     10     -29769.77   0.2049923 32.15563   0.3651554
    2      9     -29679.41   0.1913860 32.07003   0.4018582
    3      8     -29682.97   0.1619718 32.45093   0.4407269
    4      7     -29617.64   0.1556135 33.58472   0.4908904
    5      6     -29632.34   0.1247883 33.04505   0.5502962
    6      5     -29634.21   0.1201017 34.07776   0.6244967
    7      4     -29685.18   0.1134287 35.96230   0.7129967
    8      3     -29864.36   0.1070237 38.18795   0.8194811
    9      2     -30216.09   0.1040786 42.01731   0.9678864
    
    
    FindTopicsNumber_plot(result)
    

    enter image description here

    Based on griffiths 5 topics would be a good choice. By Devaud 2, Arun 9. So lets run a set of different topics over this. I added 3 as well, but read up on each metric.

    no_topics <- c(2, 3, 5, 9)
    
    lda_list <- lapply(no_topics, function(k) LDA(k = k, 
                                                  x = my_dtm, 
                                                  method = "Gibbs", 
                                                  control = control_list_gibbs
                                               )
                       )
    names(lda_list) <- paste0("no_", no_topics)
    
    lda_list
    
    $no_2
    A LDA_Gibbs topic model with 2 topics.
    
    $no_3
    A LDA_Gibbs topic model with 3 topics.
    
    $no_5
    A LDA_Gibbs topic model with 5 topics.
    
    $no_9
    A LDA_Gibbs topic model with 9 topics.
    

    After this it is going to be case of inspecting the lda outcomes to see if any of them are any good.

    For an in depth overview on this subject, you can read this blogpost. The author uses purrr, tidytext, dplyr and ggplot2 to investigate a dataset.

    And here is a blog post about using cross validation with ldatuning and topicmodels.