I am using text2vec package in R for training word embedding (Glove Model) as:
library(text2vec)
library(tm)
prep_fun = tolower
tok_fun = word_tokenizer
tokens = docs %>% # docs: a collection of text documents
prep_fun %>%
tok_fun
it = itoken(tokens, progressbar = FALSE)
stopword <- tm::stopwords("SMART")
vocab = create_vocabulary(it,stopwords=stopword)
vectorizer <- vocab_vectorizer(vocab)
tcm <- create_tcm(it, vectorizer, skip_grams_window = 6)
x_max <- min(50,max(10,ceiling(length(vocab$doc_count)/100)))
glove_model <- GlobalVectors$new(word_vectors_size = 200, vocabulary = vocab, x_max = x_max,learning_rate = 0.1)
word_vectors <- glove_model$fit_transform(tcm, n_iter = 1000, convergence_tol = 0.001)
When I run this code I get the following output:
My questions are:
I appreciate your response.
Many thanks, Sam
There is a member of the GlobalVectors
class called n_dump_every
. You can set it to some number and the history of word embeddings will be saved. Then it can be retrieved with get_history()
function
glove_model <- GlobalVectors$new(word_vectors_size = 200, vocabulary = vocab, x_max = 100,learning_rate = 0.1)
glove_model$n_dump_every = 10
word_vectors <- glove_model$fit_transform(tcm, n_iter = 1000, convergence_tol = 0.001)
trace = glove_model$get_history()
Regarding second question -
word_vectors_size
. For wikipedia size 300 is usually enough. For smaller datasets you may start with 20-50. You really need to experiment with this.