I am analyzing parsed/segmented foreign language (simplified Chinese) text documents with R's stm
to leverage the package's plotting environment. I did not use the package's built-in text processing function as it currently does not support handling Chinese text; however, after I successfully prepared the data (which requires documents
and vocab
in lda
format combined with the original meta data of the same row length) and fitted the model, the plot()
function threw me an error message, plausibly owing to some encoding issues at the preprocessing stage:
Error in nchar(text) : invalid multibyte string, element 1
Following the suggestion from some previous threads, I applied the encoding functions from stringi
and utf8
to encode the vocab
to UTF-8 and re-plot the estimation result again, but it returned the same error. I'm wondering what's going on with the encoding and if such error is fixable because the stm
uses the base R's plotting function and the latter should have no problem with displaying foreign language text. (note that I have re-set the language locale to "Chinese" ((Simplified)_China.936) before pre-processing the raw text)
It will be really appreciated if someone could enlighten me on this. My code is provided at below.
Sys.setlocale("LC_ALL","Chinese") # set locale to simplified Chinese to render the text file
# install.packages("stm")
require(stm)
con1 <- url("https://www.dropbox.com/s/tldmo7v9ssuccxn/sample_dat.RData?dl=1")
load(con1)
names(sample_dat) # sample_dat is the original metadata and is reduced to only 3 columns
con2 <- url("https://www.dropbox.com/s/za2aeg0szt7nssd/blog_lda.RData?dl=1")
load(con2)
names(blog_lda) # blog_lda is a lda type object consists of documents and vocab
# using the script from stm vignette to prepare the data
out <- prepDocuments(blog_lda$documents, blog_lda$vocab, sample_dat)
docs <- out$documents
vocab <- out$vocab
meta <- out$meta
# estimate a 10-topic model for the ease of exposition
PrevFit <- stm(documents = docs, vocab = vocab, K = 10, prevalence =~ sentiment + s(day), max.em.its = 100, data = meta, init.type = "Spectral")
# model converged at the 65th run
# plot the model
par(mar=c(1,1,1,1))
plot(PrevFit, type = "summary", xlim = c(0, 1))
Error in nchar(text) : invalid multibyte string, element 1
#check vocab
head(vocab)
# returning some garbled text
[1] "\"�\xf3½\"," "\"���\xfa\xe8�\","
[3] "\"�\xe1\"," "\"\xc8\xcb\","
[5] "\"\u02f5\"," "\"��\xca\xc7\","
please use
vocab <- iconv(out$vocab)
or
vocab <- iconv(out$vocab, to="UTF-8")
instead