rencodingtopic-modelingchinese-locale

invalid multibyte string in foreign language encoding


I am analyzing parsed/segmented foreign language (simplified Chinese) text documents with R's stm to leverage the package's plotting environment. I did not use the package's built-in text processing function as it currently does not support handling Chinese text; however, after I successfully prepared the data (which requires documents and vocab in lda format combined with the original meta data of the same row length) and fitted the model, the plot() function threw me an error message, plausibly owing to some encoding issues at the preprocessing stage:

Error in nchar(text) : invalid multibyte string, element 1

Following the suggestion from some previous threads, I applied the encoding functions from stringi and utf8 to encode the vocab to UTF-8 and re-plot the estimation result again, but it returned the same error. I'm wondering what's going on with the encoding and if such error is fixable because the stm uses the base R's plotting function and the latter should have no problem with displaying foreign language text. (note that I have re-set the language locale to "Chinese" ((Simplified)_China.936) before pre-processing the raw text)

It will be really appreciated if someone could enlighten me on this. My code is provided at below.

Sys.setlocale("LC_ALL","Chinese")  # set locale to simplified Chinese to render the text file
# install.packages("stm")
require(stm)

con1 <- url("https://www.dropbox.com/s/tldmo7v9ssuccxn/sample_dat.RData?dl=1")
load(con1)
names(sample_dat)  # sample_dat is the original metadata and is reduced to only 3 columns
con2 <- url("https://www.dropbox.com/s/za2aeg0szt7nssd/blog_lda.RData?dl=1")
load(con2)
names(blog_lda)   # blog_lda is a lda type object consists of documents and vocab

# using the script from stm vignette to prepare the data
out <- prepDocuments(blog_lda$documents, blog_lda$vocab, sample_dat)
docs <- out$documents
vocab <- out$vocab
meta <- out$meta

# estimate a 10-topic model for the ease of exposition
PrevFit <- stm(documents = docs, vocab = vocab, K = 10, prevalence =~ sentiment + s(day), max.em.its = 100, data = meta, init.type = "Spectral")
# model converged at the 65th run
# plot the model
par(mar=c(1,1,1,1))
plot(PrevFit, type = "summary", xlim = c(0, 1))
Error in nchar(text) : invalid multibyte string, element 1

#check vocab
head(vocab)
# returning some garbled text
[1] "\"�\xf3½\","       "\"���\xfa\xe8�\","
[3] "\"�\xe1\","        "\"\xc8\xcb\","    
[5] "\"\u02f5\","       "\"��\xca\xc7\","  


Solution

  • please use

    vocab <- iconv(out$vocab)

    or

    vocab <- iconv(out$vocab, to="UTF-8")

    instead