rplottext2vecperplexity

Elbow/knee in a curve in R


I've got this data processing:

library(text2vec)

##Using perplexity for hold out set
t1 <- Sys.time()
perplex <- c()
for (i in 3:25){

  set.seed(17)
  lda_model2 <- LDA$new(n_topics = i)
  doc_topic_distr2 <- lda_model2$fit_transform(x = dtm,  progressbar = F)

  set.seed(17)
  sample.dtm2 <- itoken(rawsample$Abstract, 
                       preprocessor = prep_fun, 
                       tokenizer = tok_fun, 
                       ids = rawsample$id,
                       progressbar = F) %>%
    create_dtm(vectorizer,vtype = "dgTMatrix", progressbar = FALSE)

  set.seed(17)
  new_doc_topic_distr2 <- lda_model2$transform(sample.dtm2, n_iter = 1000, 
                                               convergence_tol = 0.001, n_check_convergence = 25, 
                                               progressbar = FALSE)

  perplex[i]  <- text2vec::perplexity(sample.dtm2, topic_word_distribution = 
                                        lda_model2$topic_word_distribution, 
                                      doc_topic_distribution = new_doc_topic_distr2) 

}
print(difftime(Sys.time(), t1, units = 'sec'))

I know there are a lot of questions like this, but I haven't been able to exactly find the answer to my situation. Above you see perplexity calculation from 3 to 25 topic number for a Latent Dirichlet Allocation model. I want to get the most sufficient value among those, meaning that I want to find the elbow or knee, for those values that might only be considered as a simple numeric vector which outcome looks like this:

1   NA
2   NA
3   222.6229
4   210.3442
5   200.1335
6   190.3143
7   180.4195
8   174.2634
9   166.2670
10  159.7535
11  153.7785
12  148.1623
13  144.1554
14  141.8250
15  138.8301
16  134.4956
17  131.0745
18  128.8941
19  125.8468
20  123.8477
21  120.5155
22  118.4426
23  116.4619
24  113.2401
25  114.1233
plot(perplex)

This is how plot looks like

I would say that the elbow would be 13 or 16, but I'm not completely sure and I want the exact number as an outcome. I saw in this paper that f''(x) / (1+f'(x)^2)^1.5 is the knee formula, which I tried like this and says it's 18:

> d1 <- diff(perplex)                # first derivative
> d2 <- diff(d1) / diff(perplex[-1]) # second derivative
> knee <- (d2)/((1+(d1)^2)^1.5)
Warning message:
In (d2)/((1 + (d1)^2)^1.5) :
  longer object length is not a multiple of shorter object length
> which.min(knee)
[1] 18

I can't fully figure this thing out. Would someone like to share how I could get the exact ideal topics number according to perplexity as an outcome?


Solution

  • Found this: "The LDA model with the optimal coherence score, obtained with an elbow method (the point with maximum absolute second derivative) (...)" in this paper, so this coding does the work: d1 <- diff(perplex); k <- which.max(abs(diff(d1) / diff(perplex[-1])))