I’ve spent a couple of days working on topic models in R and I’m wondering if I could do the following:
I would like R to build topics based on a predefined termlist with specific terms. I already worked with this list to identify ngrams (RWeka) in documents and count only those terms which occur in my termlist using the following code:
terms=read.delim("TermList.csv", header=F, stringsAsFactor=F)
biTok=function(x) NGramTokenizer(x, Weka_control(min=1, max=4))
tdm=TermDocumentMatrix(data.corpus, control=list(tokenizer=biTok))
Now I would like to use this list again to search for topics in the documents based only on the terms in my termlist.
Example: Within the following sentence: "The arrangements results in higher team performance and better user satisfaction" I would like to have the compound terms "team performance" and "user satisfaction" within topics instead of handling "team", "performance", "user" and "satisfaction" as single terms and building topics over them. This is why I need to use my predefined list.
Is there any possibility to define such a condition in R ?
Perhaps something like this?
tokenizing.phrases <- c("team performance", "user satisfaction") # plus your others you have identified
Then load this function:
phraseTokenizer <- function(x) {
require(stringr)
x <- as.character(x) # extract the plain text from the tm TextDocument object
x <- str_trim(x)
if (is.na(x)) return("")
#warning(paste("doing:", x))
phrase.hits <- str_detect(x, ignore.case(tokenizing.phrases))
if (any(phrase.hits)) {
# only split once on the first hit, so we don't have to worry about multiple occurences of the same phrase
split.phrase <- tokenizing.phrases[which(phrase.hits)[1]]
# warning(paste("split phrase:", split.phrase))
temp <- unlist(str_split(x, ignore.case(split.phrase), 2))
out <- c(phraseTokenizer(temp[1]), split.phrase, phraseTokenizer(temp[2]))
} else {
out <- MC_tokenizer(x)
}
# get rid of any extraneous empty strings, which can happen if a phrase occurs just before a punctuation
out[out != ""]
}
Then create your term document matrix with the predefined tokeninzing.phrases:
tdm <- TermDocumentMatrix(corpus, control = list(tokenize = phraseTokenizer))
When you then run your topic model function, it should work with the bigrams you have identified as part of the model (albeit a longer list according to what you have identified).