I was using latent semantic analysis in the text2vec
package to generate word vectors and using transform to fit new data when I noticed something odd, the spaces not being lined up when trained on the same data.
There appears to be some inconsistency (or randomness?) in the method. Namely, even when re-running an LSA model on the exact same data, the resulting word vectors are wildly different, despite indentical input. When looking around I only found these old closed github issues link link and a mention in the changelog about LSA being cleaned up. I reproduced the behaviour using the movie_review dataset and (slightly modified) code from the documentation:
library(text2vec)
packageVersion("text2vec") # ‘0.5.1’
data("movie_review")
N = 1000
tokens = word_tokenizer(tolower(movie_review$review[1:N]))
it=itoken(tokens)
voc = create_vocabulary(it) %>% prune_vocabulary(term_count_min = 5, doc_proportion_max =0.9)
vectorizer = vocab_vectorizer(voc)
tcm = create_tcm(it, vectorizer)
# edit: make tcm symmetric:
tcm = tcm + Matrix::t(Matrix::triu(tcm))
n_topics = 10
lsa_1 = LatentSemanticAnalysis$new(n_topics)
d1 = lsa_1$fit_transform(tcm)
lsa_2 = LatentSemanticAnalysis$new(n_topics)
d2 = lsa_2$fit_transform(tcm)
# despite being trained on the same data, words have completely different vectors:
sim2(d1["film",,drop=F], d2["film",,drop=F])
# yields values like -0.993363 but sometimes 0.9888435 (should be 1)
mean(diag(sim2(d1, d2)))
# e.g. -0.2316826
hist(diag(sim2(d1, d2)), main="self-similarity between models")
# note: these numbers are different every time!
# But: within each model, results seem consistent and reasonable:
# top similar words for "film":
head(sort(sim2(d1, d1["film",,drop=F])[,1],decreasing = T))
# film movie show piece territory bay
# 1.0000000 0.9873934 0.9803280 0.9732380 0.9680488 0.9668800
# same in the second model:
head(sort(sim2(d2, d2["film",,drop=F])[,1],decreasing = T))
# film movie show piece territory bay
# 1.0000000 0.9873935 0.9803279 0.9732364 0.9680495 0.9668819
# transform works:
sim2(d2["film",,drop=F], transform(tcm["film",,drop=F], lsa_2 )) # yields 1
# LSA in quanteda doesn't have this problem, same data => same vectors
library(quanteda)
d1q = textmodel_lsa(as.dfm(tcm), 10)
d2q = textmodel_lsa(as.dfm(tcm), 10)
mean(diag(sim2(d1q$docs, d2q$docs))) # yields 1
# the top synonyms for "film" are also a bit different with quanteda's LSA
# film movie hunk show territory bay
# 1.0000000 0.9770574 0.9675766 0.9642915 0.9577723 0.9573138
What's the deal, is it a bug, is this intended behaviour for some reason, or am I having a massive misunderstanding? (I'm kind of hoping for the latter...). If it's intended, why would quanteda behave differently?
The issue is that your matrix seems ill-conditioned and hence you have numerical stability issues.
library(text2vec)
library(magrittr)
data("movie_review")
N = 1000
tokens = word_tokenizer(tolower(movie_review$review[1:N]))
it=itoken(tokens)
voc = create_vocabulary(it) %>% prune_vocabulary(term_count_min = 5, doc_proportion_max =0.9)
vectorizer = vocab_vectorizer(voc)
tcm = create_tcm(it, vectorizer)
# condition number
kappa(tcm)
# Inf
Now if you will do truncated SVD (algorithm behind LSA) you will notice that singular vectors are very close to zero:
library(irlba)
truncated_svd = irlba(tcm, 10)
str(truncated_svd)
# $ d : num [1:10] 2139 1444 660 559 425 ...
# $ u : num [1:4387, 1:10] -1.44e-04 -1.62e-04 -7.77e-05 -8.44e-04 -8.99e-04 ...
# $ v : num [1:4387, 1:10] 6.98e-20 2.37e-20 4.09e-20 -4.73e-20 6.62e-20 ...
# $ iter : num 3
# $ mprod: num 50
Hence the sign of the embeddings is not stable and cosine angle between them is not stable as well.