I was wondering how results of different packages, hence, algorithms, differ and if parameters could be set in a way to produce similar topics. I had a look at the packages text2vec
and topicmodels
in particular.
I used below code to compare 10 topics (see code section for terms) generated with these packages. I could not manage to generate sets of topics with similar meaning. E.g. topic 10 from text2vec
has something to do with "police", none of the topics produced by topicmodels
refers to "police" or similar terms. Further, I could not identify a pendant of topic 5 produced by topicmodels
that has something to do with "life-love-familiy-war" in the topics produced by text2vec
.
I am a beginner with LDA, hence, my understanding may sound naive for experienced programmers. However, intuitively, one would asssume that it should be possible to produce sets of topics with similar meaning to prove validity/robustness of results. Of course, not necessarily the exact same set of terms, but termlists addressing similar topics.
Maybe the issue is simply that my human interpretation of these termlists is not good enough to capture similarities, but maybe there are some parameters that might increase similarity for human interpretation. Can someone guide me on how to set parameters to achieve this or otherwise provide explanations or hint on suitable resources to improve my understanding of the matter?
Here some issues that might be relevant:
text2vec
does not use standard Gibbs sampling but WarpLDA, which already is a difference in the algorithm to topcimodels
. If my understanding is correct, the priors alpha
and delta
used in topicmodels
are set as doc_topic_prior
and topic_word_prior
in text2vec
respectively.lambda
for sorting terms of topics based on their frequency. I have not yet understood, how terms are sorted in topicmodels
- comparable to setting lambda=1
?. (I have tried different lambdas between 0 to 1 without getting similar topics)seed
(see, e.g., this question). This is not directly my question but might make it more difficult to respond.Sorry for the lenghty question and thanks in advance for any help or suggestions.
Update2: I have moved the content of my first update into an answer that is based on a more complete analysis.
Update: Following the helpful comment of text2vec
package creator Dmitriy Selivanov, I can confirm that setting lambda=1
increases the similarity of topics betweeen the termlists produced by the two packages.
Furthermore, I had a closer look at the differences between termlists produced by both packages via a quick check of length(setdiff())
and length(intersect())
across topics (see in below code). This rough check shows that text2vec
discards several terms per topic - probably by a threshold of probability for the individual topics? topicmodels
keeps all terms for all topics. This explains part of the differences in meanings that can be derived (by a human) from the termlists.
As mentioned above already, generating a reproducible example seems difficult, so I have not adapted all data examples in below code. Since run time is short, anybody can check on his/her own system.
library(text2vec)
library(topicmodels)
library(slam) #to convert dtm to simple triplet matrix for topicmodels
ntopics <- 10
alphaprior <- 0.1
deltaprior <- 0.001
niter <- 1000
convtol <- 0.001
set.seed(0) #for text2vec
seedpar <- 0 #for topicmodels
#Generate document term matrix with text2vec
tokens = movie_review$review[1:1000] %>%
tolower %>%
word_tokenizer
it = itoken(tokens, ids = movie_review$id[1:1000], progressbar = FALSE)
vocab = create_vocabulary(it) %>%
prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)
vectorizer = vocab_vectorizer(vocab)
dtm = create_dtm(it, vectorizer, type = "dgTMatrix")
#LDA model with text2vec
lda_model = text2vec::LDA$new(n_topics = ntopics
,doc_topic_prior = alphaprior
,topic_word_prior = deltaprior
)
doc_topic_distr = lda_model$fit_transform(x = dtm
,n_iter = niter
,convergence_tol = convtol
,n_check_convergence = 25
,progressbar = FALSE
)
#LDA model with topicmodels
ldatopicmodels <- LDA(as.simple_triplet_matrix(dtm), k = ntopics, method = "Gibbs",
LDA_Gibbscontrol = list(burnin = 100
,delta = deltaprior
,alpha = alphaprior
,iter = niter
,keep = 50
,tol = convtol
,seed = seedpar
,initialize = "seeded"
)
)
#show top 15 words
lda_model$get_top_words(n = 10, topic_number = c(1:10), lambda = 0.3)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] "finally" "men" "know" "video" "10" "king" "five" "our" "child" "cop"
# [2,] "re" "always" "ve" "1" "doesn" "match" "atmosphere" "husband" "later" "themselves"
# [3,] "three" "lost" "got" "head" "zombie" "lee" "mr" "comedy" "parents" "mary"
# [4,] "m" "team" "say" "girls" "message" "song" "de" "seem" "sexual" "average"
# [5,] "gay" "here" "d" "camera" "start" "musical" "may" "man" "murder" "scenes"
# [6,] "kids" "within" "funny" "kill" "3" "four" "especially" "problem" "tale" "police"
# [7,] "sort" "score" "want" "stupid" "zombies" "dance" "quality" "friends" "television" "appears"
# [8,] "few" "thriller" "movies" "talking" "movies" "action" "public" "given" "okay" "trying"
# [9,] "bit" "surprise" "let" "hard" "ask" "fun" "events" "crime" "cover" "waiting"
# [10,] "hot" "own" "thinking" "horrible" "won" "tony" "u" "special" "stan" "lewis"
# [11,] "die" "political" "nice" "stay" "open" "twist" "kelly" "through" "uses" "imdb"
# [12,] "credits" "success" "never" "back" "davis" "killer" "novel" "world" "order" "candy"
# [13,] "two" "does" "bunch" "didn" "completely" "ending" "copy" "show" "strange" "name"
# [14,] "otherwise" "beauty" "hilarious" "room" "love" "dancing" "japanese" "new" "female" "low"
# [15,] "need" "brilliant" "lot" "minutes" "away" "convincing" "far" "mostly" "girl" "killing"
terms(ldatopicmodels, 10)
# Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
# [1,] "show" "where" "horror" "did" "life" "such" "m" "films" "man" "seen"
# [2,] "years" "minutes" "pretty" "10" "young" "character" "something" "music" "new" "movies"
# [3,] "old" "gets" "best" "now" "through" "while" "re" "actors" "two" "plot"
# [4,] "every" "guy" "ending" "why" "love" "those" "going" "role" "though" "better"
# [5,] "series" "another" "bit" "saw" "woman" "does" "things" "performance" "big" "worst"
# [6,] "funny" "around" "quite" "didn" "us" "seems" "want" "between" "back" "interesting"
# [7,] "comedy" "nothing" "little" "say" "real" "book" "thing" "love" "action" "your"
# [8,] "again" "down" "actually" "thought" "our" "may" "know" "play" "shot" "money"
# [9,] "tv" "take" "house" "still" "war" "work" "ve" "line" "together" "hard"
# [10,] "watching" "these" "however" "end" "father" "far" "here" "actor" "against" "poor"
# [11,] "cast" "fun" "cast" "got" "find" "scenes" "doesn" "star" "title" "least"
# [12,] "long" "night" "entertaining" "2" "human" "both" "look" "never" "go" "say"
# [13,] "through" "scene" "must" "am" "shows" "yet" "isn" "played" "city" "director"
# [14,] "once" "back" "each" "done" "family" "audience" "anything" "hollywood" "came" "probably"
# [15,] "watched" "dead" "makes" "3" "mother" "almost" "enough" "always" "match" "video"
#UPDATE
#number of terms in each model is the same
length(ldatopicmodels@terms)
# [1] 2170
nrow(vocab)
# [1] 2170
#number of NA entries for termlist of first topic differs
sum(is.na(
lda_model$get_top_words(n = nrow(vocab), topic_number = c(1:10), lambda = 1)[,1]
)
)
#[1] 1778
sum(is.na(
terms(ldatopicmodels, length(ldatopicmodels@terms))
)
)
#[1] 0
#function to check number of terms that differ between two sets of topic collections (excluding NAs)
lengthsetdiff <- function(x, y) {
apply(x, 2, function(i) {
apply(y, 2, function(j) {
length(setdiff(i[!is.na(i)],j[!is.na(j)]))
})
})
}
#apply the check
termstopicmodels <- terms(ldatopicmodels,length(ldatopicmodels@terms))
termstext2vec <- lda_model$get_top_words(n = nrow(vocab), topic_number = c(1:10), lambda = 1)
lengthsetdiff(termstopicmodels,
termstopicmodels)
# Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
# Topic 1 0 0 0 0 0 0 0 0 0 0
# Topic 2 0 0 0 0 0 0 0 0 0 0
# Topic 3 0 0 0 0 0 0 0 0 0 0
# Topic 4 0 0 0 0 0 0 0 0 0 0
# Topic 5 0 0 0 0 0 0 0 0 0 0
# Topic 6 0 0 0 0 0 0 0 0 0 0
# Topic 7 0 0 0 0 0 0 0 0 0 0
# Topic 8 0 0 0 0 0 0 0 0 0 0
# Topic 9 0 0 0 0 0 0 0 0 0 0
# Topic 10 0 0 0 0 0 0 0 0 0 0
lengthsetdiff(termstext2vec,
termstext2vec)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] 0 340 318 335 292 309 320 355 294 322
# [2,] 355 0 321 343 292 319 311 346 302 339
# [3,] 350 338 0 316 286 309 311 358 318 322
# [4,] 346 339 295 0 297 310 301 335 309 332
# [5,] 345 330 307 339 0 310 310 354 309 333
# [6,] 350 345 318 340 298 0 311 342 308 325
# [7,] 366 342 325 336 303 316 0 364 311 325
# [8,] 355 331 326 324 301 301 318 0 311 335
# [9,] 336 329 328 340 298 309 307 353 0 314
# [10,] 342 344 310 341 300 304 299 355 292 0
lengthsetdiff(termstopicmodels,
termstext2vec)
# Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
# [1,] 1778 1778 1778 1778 1778 1778 1778 1778 1778 1778
# [2,] 1793 1793 1793 1793 1793 1793 1793 1793 1793 1793
# [3,] 1810 1810 1810 1810 1810 1810 1810 1810 1810 1810
# [4,] 1789 1789 1789 1789 1789 1789 1789 1789 1789 1789
# [5,] 1831 1831 1831 1831 1831 1831 1831 1831 1831 1831
# [6,] 1819 1819 1819 1819 1819 1819 1819 1819 1819 1819
# [7,] 1824 1824 1824 1824 1824 1824 1824 1824 1824 1824
# [8,] 1778 1778 1778 1778 1778 1778 1778 1778 1778 1778
# [9,] 1820 1820 1820 1820 1820 1820 1820 1820 1820 1820
# [10,] 1798 1798 1798 1798 1798 1798 1798 1798 1798 1798
lengthsetdiff(termstext2vec,
termstopicmodels)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# Topic 1 0 0 0 0 0 0 0 0 0 0
# Topic 2 0 0 0 0 0 0 0 0 0 0
# Topic 3 0 0 0 0 0 0 0 0 0 0
# Topic 4 0 0 0 0 0 0 0 0 0 0
# Topic 5 0 0 0 0 0 0 0 0 0 0
# Topic 6 0 0 0 0 0 0 0 0 0 0
# Topic 7 0 0 0 0 0 0 0 0 0 0
# Topic 8 0 0 0 0 0 0 0 0 0 0
# Topic 9 0 0 0 0 0 0 0 0 0 0
# Topic 10 0 0 0 0 0 0 0 0 0 0
#also the intersection can be checked between the two sets
lengthintersect <- function(x, y) {
apply(x, 2, function(i) {
apply(y, 2, function(j) {
length(intersect(i[!is.na(i)], j[!is.na(j)]))
})
})
}
lengthintersect(termstopicmodels,
termstext2vec)
# Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
# [1,] 392 392 392 392 392 392 392 392 392 392
# [2,] 377 377 377 377 377 377 377 377 377 377
# [3,] 360 360 360 360 360 360 360 360 360 360
# [4,] 381 381 381 381 381 381 381 381 381 381
# [5,] 339 339 339 339 339 339 339 339 339 339
# [6,] 351 351 351 351 351 351 351 351 351 351
# [7,] 346 346 346 346 346 346 346 346 346 346
# [8,] 392 392 392 392 392 392 392 392 392 392
# [9,] 350 350 350 350 350 350 350 350 350 350
# [10,] 372 372 372 372 372 372 372 372 372 372
After having updated my question with some comparison results, I was still interested more in detail. Therefore, I have run lda models on the complete movie_review
data set included in text2vec
(5000 docs). To produce half-way realistic results, I have also introduced some gentle pre-processing and stopword removal. (Sorry for the long code example below)
My conclusion is that some of the "good" topics (from a subjective standpoint) produced by the two packages are comparable to a certain extent (especially the last three topics in below example are not really good and were difficult to compare). However, looking at similar topics between the two packages, produced different (subjective) associations for each topic. Hence, the standard Gibbs
sampling and the WarpLDA
algorithm seem to capture similar topical areas for the given data, but with different "moods" expressed in the topics.
I would see the main reason for the differences in the fact that the WarpLDA
algorithm seems to discard terms and introduce NA
values in the beta
matrix (term-topic-distribution). See below example for this. Hence, its faster convergence seems to be achieved by sacrificing completeness.
I do not want to judge which topics are subjectively "better" and leave this to your own judgement.
One important limitation of this analysis is, that I have not (yet) checked the results for an optimal number of topics, I only used k=10
. Hence, comparability of the topics might increase for an optimal k
, in any case the quality will improve and thereby maybe the "mood". (The optimal k
might again differ between the algorithms depending on the measure used to find k
.)
library(text2vec)
library(topicmodels)
library(slam) #to convert dtm to simple triplet matrix for topicmodels
library(LDAvis)
library(tm) #for stopwords only
ntopics <- 10
alphaprior <- 0.1
deltaprior <- 0.001
niter <- 1000
convtol <- 0.001
set.seed(0) #for text2vec
seedpar <- 0 #for topicmodels
docs <- movie_review$review
preproc_fun <- function(x) {
tolower(x) %>%
{ gsub("[\\W]+", " ", ., perl=T) } %>%
{ gsub("[\\d]+", " ", ., perl=T) } %>%
{ gsub(paste0("(?<=\\b)(\\w{1,", 2, "})(?=\\b)"), "", ., perl=T) } %>%
{ gsub("\\s+", " ", . , perl=T) } %>%
{ gsub("^\\s|\\s$", "", ., perl=T) } %>%
return()
}
#Generate document term matrix with text2vec
tokens = docs %>%
preproc_fun %>%
word_tokenizer
it = itoken(tokens, ids = movie_review$id, progressbar = FALSE)
vocab = create_vocabulary(it, stopwords = tm::stopwords()) %>%
prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)
vectorizer = vocab_vectorizer(vocab)
dtm = create_dtm(it, vectorizer, type = "dgTMatrix")
dim(dtm)
# [1] 5000 7407
#LDA model with text2vec
ldatext2vec = text2vec::LDA$new(n_topics = ntopics
,doc_topic_prior = alphaprior
,topic_word_prior = deltaprior
)
doc_topic_distr = ldatext2vec$fit_transform(x = dtm
,n_iter = niter
,convergence_tol = convtol
,n_check_convergence = 25
,progressbar = FALSE
)
control_Gibbs_topicmodels <- list(
alpha = alphaprior
,delta = deltaprior
,iter = niter
,burnin = 100
,keep = 50
,nstart = 1
,best = TRUE
,seed = seedpar
)
#LDA model with topicmodels
ldatopicmodels <- LDA(as.simple_triplet_matrix(dtm)
,k = ntopics
,method = "Gibbs"
,control = control_Gibbs_topicmodels
)
#I have ordered the topics manually after printing top 15 terms and put similar (at least from my subjective standpoint) topics at the beginning
topicsterms_ldatopicmodels <- terms(ldatopicmodels,length(ldatopicmodels@terms))[,c(6,8,10,3,5,9,7,4,1,2)]
topicsterms_ldatext2vec <- ldatext2vec$get_top_words(n = nrow(vocab), topic_number = c(1:10), lambda = 1)[, c(9,6,4,10,5,3,7,2,8,1)]
#show top 15 words
topicsterms_ldatext2vec[1:15,]
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] "show" "performance" "films" "war" "horror" "say" "man" "love" "know" "man"
# [2,] "series" "role" "director" "american" "killer" "better" "back" "life" "say" "woman"
# [3,] "funny" "films" "scenes" "book" "doesn" "nothing" "last" "big" "life" "life"
# [4,] "still" "music" "audience" "may" "little" "watching" "match" "real" "didn" "police"
# [5,] "original" "love" "though" "world" "isn" "know" "big" "women" "going" "father"
# [6,] "years" "cast" "may" "young" "guy" "worst" "men" "job" "now" "world"
# [7,] "version" "john" "quite" "family" "actually" "didn" "takes" "black" "something" "black"
# [8,] "episode" "play" "real" "mother" "gets" "something" "woman" "new" "things" "wife"
# [9,] "now" "man" "seems" "true" "dead" "actors" "take" "money" "back" "goes"
# [10,] "dvd" "played" "work" "years" "look" "minutes" "young" "work" "saw" "new"
# [11,] "saw" "actor" "scene" "novel" "house" "films" "life" "game" "family" "without"
# [12,] "old" "excellent" "actors" "however" "looks" "least" "city" "world" "love" "around"
# [13,] "watching" "young" "interesting" "small" "poor" "script" "town" "still" "thought" "scene"
# [14,] "watched" "perfect" "rather" "quite" "pretty" "budget" "dance" "comedy" "got" "shot"
# [15,] "better" "high" "yet" "history" "stupid" "lot" "rock" "american" "thing" "another"
topicsterms_ldatopicmodels[1:15,]
# Topic 6 Topic 8 Topic 10 Topic 3 Topic 5 Topic 9 Topic 7 Topic 4 Topic 1 Topic 2
# [1,] "show" "performance" "films" "war" "horror" "funny" "man" "love" "life" "little"
# [2,] "years" "role" "director" "american" "house" "better" "wife" "book" "love" "music"
# [3,] "series" "cast" "something" "documentary" "scene" "say" "gets" "films" "world" "action"
# [4,] "now" "actor" "enough" "part" "killer" "know" "father" "version" "young" "fun"
# [5,] "episode" "play" "doesn" "world" "sex" "watching" "back" "still" "family" "big"
# [6,] "old" "performances" "nothing" "history" "scenes" "thing" "goes" "original" "real" "rock"
# [7,] "back" "comedy" "actually" "america" "gore" "pretty" "new" "quite" "may" "king"
# [8,] "love" "played" "things" "new" "blood" "guy" "woman" "music" "man" "animation"
# [9,] "saw" "director" "seems" "hollywood" "around" "didn" "later" "years" "work" "films"
# [10,] "shows" "job" "know" "japanese" "little" "got" "home" "scenes" "little" "black"
# [11,] "new" "john" "without" "white" "woman" "worst" "money" "old" "lives" "song"
# [12,] "family" "actors" "real" "shot" "night" "thought" "son" "scene" "mother" "pretty"
# [13,] "dvd" "star" "far" "despite" "dead" "wasn" "police" "better" "men" "quite"
# [14,] "still" "excellent" "might" "still" "zombie" "minutes" "husband" "bit" "find" "musical"
# [15,] "know" "work" "fact" "early" "scary" "stupid" "town" "times" "women" "effects"
#number of total terms for each model is the same
#however, the ldatext2vec from text2vec has NA values
length(ldatopicmodels@terms)
# [1] 7407
length(ldatopicmodels@terms[ !is.na(ldatopicmodels@terms)])
# [1] 7407
terms_ldatext2vec <- unique(as.character(topicsterms_ldatext2vec))
length(terms_ldatext2vec)
# [1] 7408
length(terms_ldatext2vec[!is.na(terms_ldatext2vec)])
# [1] 7407
#number of NA entries in topic/termlists of text2vec ldatext2vec
dim(topicsterms_ldatext2vec)
#[1] 7407 10
sum(is.na(topicsterms_ldatext2vec))
# [1] 60368
#share of NA values
sum(is.na(topicsterms_ldatext2vec))/(dim(topicsterms_ldatext2vec)[1]*dim(topicsterms_ldatext2vec)[2])
#[1] 0.8150128
#no NA values in ldatopicmodels
sum(is.na(terms(ldatopicmodels, length(ldatopicmodels@terms))))
#[1] 0
#function to check number of terms that differ between two sets of topic collections (excluding NAs)
lengthsetdiff <- function(x, y) {
apply(x, 2, function(i) {
apply(y, 2, function(j) {
length(setdiff(i[!is.na(i)],j[!is.na(j)]))
})
})
}
#also the intersection can be checked between the two sets
lengthintersect <- function(x, y) {
apply(x, 2, function(i) {
apply(y, 2, function(j) {
length(intersect(i[!is.na(i)], j[!is.na(j)]))
})
})
}
#since especially the top words are of interest, we first check the intersection of top 20 words
#please note that the order of the topics, especially the last 3 is subjective
lengthintersect(topicsterms_ldatopicmodels[1:20,],
topicsterms_ldatext2vec[1:20,])
# Topic 6 Topic 8 Topic 10 Topic 3 Topic 5 Topic 9 Topic 7 Topic 4 Topic 1 Topic 2
# [1,] 13 1 0 2 0 3 1 7 1 2
# [2,] 1 9 1 0 0 0 2 4 5 4
# [3,] 0 4 8 0 2 0 0 4 3 2
# [4,] 3 0 0 5 2 0 1 5 6 2
# [5,] 1 0 3 0 7 7 1 1 1 2
# [6,] 2 3 6 0 0 10 0 3 0 1
# [7,] 4 2 1 2 1 0 8 1 4 3
# [8,] 3 4 2 5 1 1 2 3 8 5
# [9,] 10 0 4 0 0 8 1 3 3 1
# [10,] 1 0 1 3 3 0 7 1 5 2
#apply the check with the topics ordered as shown above for the top 15 words
#all words are appear in each topic
lengthsetdiff(topicsterms_ldatopicmodels,
topicsterms_ldatopicmodels)
# Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
# Topic 1 0 0 0 0 0 0 0 0 0 0
# Topic 2 0 0 0 0 0 0 0 0 0 0
# Topic 3 0 0 0 0 0 0 0 0 0 0
# Topic 4 0 0 0 0 0 0 0 0 0 0
# Topic 5 0 0 0 0 0 0 0 0 0 0
# Topic 6 0 0 0 0 0 0 0 0 0 0
# Topic 7 0 0 0 0 0 0 0 0 0 0
# Topic 8 0 0 0 0 0 0 0 0 0 0
# Topic 9 0 0 0 0 0 0 0 0 0 0
# Topic 10 0 0 0 0 0 0 0 0 0 0
#not all words appear in each topic
lengthsetdiff(topicsterms_ldatext2vec ,
topicsterms_ldatext2vec )
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] 0 1188 1216 1241 1086 1055 1196 1131 1126 1272
# [2,] 1029 0 1203 1223 1139 1073 1188 1140 1188 1260
# [3,] 1032 1178 0 1224 1084 1024 1186 1122 1164 1238
# [4,] 1075 1216 1242 0 1175 1139 1202 1152 1207 1271
# [5,] 1011 1223 1193 1266 0 1082 1170 1170 1160 1214
# [6,] 993 1170 1146 1243 1095 0 1178 1119 1092 1206
# [7,] 1078 1229 1252 1250 1127 1122 0 1200 1195 1227
# [8,] 1030 1198 1205 1217 1144 1080 1217 0 1171 1211
# [9,] 966 1187 1188 1213 1075 994 1153 1112 0 1198
# [10,] 1095 1242 1245 1260 1112 1091 1168 1135 1181 0
#difference of terms in topics per topic between the two models
lengthsetdiff(topicsterms_ldatopicmodels,
topicsterms_ldatext2vec)
# Topic 6 Topic 8 Topic 10 Topic 3 Topic 5 Topic 9 Topic 7 Topic 4 Topic 1 Topic 2
# [1,] 6157 6157 6157 6157 6157 6157 6157 6157 6157 6157
# [2,] 5998 5998 5998 5998 5998 5998 5998 5998 5998 5998
# [3,] 5973 5973 5973 5973 5973 5973 5973 5973 5973 5973
# [4,] 5991 5991 5991 5991 5991 5991 5991 5991 5991 5991
# [5,] 6082 6082 6082 6082 6082 6082 6082 6082 6082 6082
# [6,] 6095 6095 6095 6095 6095 6095 6095 6095 6095 6095
# [7,] 6039 6039 6039 6039 6039 6039 6039 6039 6039 6039
# [8,] 6056 6056 6056 6056 6056 6056 6056 6056 6056 6056
# [9,] 5997 5997 5997 5997 5997 5997 5997 5997 5997 5997
# [10,] 5980 5980 5980 5980 5980 5980 5980 5980 5980 5980
lengthsetdiff(topicsterms_ldatext2vec,
topicsterms_ldatopicmodels)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# Topic 6 0 0 0 0 0 0 0 0 0 0
# Topic 8 0 0 0 0 0 0 0 0 0 0
# Topic 10 0 0 0 0 0 0 0 0 0 0
# Topic 3 0 0 0 0 0 0 0 0 0 0
# Topic 5 0 0 0 0 0 0 0 0 0 0
# Topic 9 0 0 0 0 0 0 0 0 0 0
# Topic 7 0 0 0 0 0 0 0 0 0 0
# Topic 4 0 0 0 0 0 0 0 0 0 0
# Topic 1 0 0 0 0 0 0 0 0 0 0
# Topic 2 0 0 0 0 0 0 0 0 0 0