I am currently working on a Text Mining document, where I want to abstract relevant keywords from my text (note that I have got many, many text documents).
I am using the udpipe package. A great Vignette is online on (http://bnosac.be/index.php/blog/77-an-overview-of-keyword-extraction-techniques). Everything works, but when I run the code, the part
x <- udpipe_annotate(ud_model, x = comments$feedback)
is really, really slow (especially when you have a lot of text). Is there anyone who have an idea how I get this part faster? a workaround is of course fine.
library(udpipe)
library(textrank)
## First step: Take the Spanish udpipe model and annotate the text. Note: this takes about 3 minutes
data(brussels_reviews)
comments <- subset(brussels_reviews, language %in% "es")
ud_model <- udpipe_download_model(language = "spanish")
ud_model <- udpipe_load_model(ud_model$file_model)
x <- udpipe_annotate(ud_model, x = comments$feedback) # This part is really, really slow
x <- as.data.frame(x)
Many thanks in advance!
I'm adding an answer based on the future API. This works independent of which OS (Windows, mac, or linux flavour) you are using.
The future.apply package has all parallel alternatives for the base *apply family. The rest of the code is based on the answer from @jwijffels. Only difference is that I use data.table in the annotate_splits function.
library(udpipe)
library(data.table)
data(brussels_reviews)
comments <- subset(brussels_reviews, language %in% "es")
ud_model <- udpipe_download_model(language = "spanish", overwrite = F)
ud_es <- udpipe_load_model(ud_model)
# returns a data.table
annotate_splits <- function(x, file) {
ud_model <- udpipe_load_model(file)
x <- as.data.table(udpipe_annotate(ud_model,
x = x$feedback,
doc_id = x$id))
return(x)
}
# load parallel library future.apply
library(future.apply)
# Define cores to be used
ncores <- 3L
plan(multiprocess, workers = ncores)
# split comments based on available cores
corpus_splitted <- split(comments, seq(1, nrow(comments), by = 100))
annotation <- future_lapply(corpus_splitted, annotate_splits, file = ud_model$file_model)
annotation <- rbindlist(annotation)