rkeywordtmudpipe

Make udpipe_annotate() faster


I am currently working on a Text Mining document, where I want to abstract relevant keywords from my text (note that I have got many, many text documents).

I am using the udpipe package. A great Vignette is online on (http://bnosac.be/index.php/blog/77-an-overview-of-keyword-extraction-techniques). Everything works, but when I run the code, the part

x <- udpipe_annotate(ud_model, x = comments$feedback)

is really, really slow (especially when you have a lot of text). Is there anyone who have an idea how I get this part faster? a workaround is of course fine.

library(udpipe)
library(textrank)
## First step: Take the Spanish udpipe model and annotate the text. Note: this takes about 3 minutes

data(brussels_reviews)
comments <- subset(brussels_reviews, language %in% "es")
ud_model <- udpipe_download_model(language = "spanish")
ud_model <- udpipe_load_model(ud_model$file_model)
x <- udpipe_annotate(ud_model, x = comments$feedback) # This part is really, really slow 
x <- as.data.frame(x)

Many thanks in advance!


Solution

  • I'm adding an answer based on the future API. This works independent of which OS (Windows, mac, or linux flavour) you are using.

    The future.apply package has all parallel alternatives for the base *apply family. The rest of the code is based on the answer from @jwijffels. Only difference is that I use data.table in the annotate_splits function.

    library(udpipe)
    library(data.table)
    
    data(brussels_reviews)
    comments <- subset(brussels_reviews, language %in% "es")
    ud_model <- udpipe_download_model(language = "spanish", overwrite = F)
    ud_es <- udpipe_load_model(ud_model)
    
    
    # returns a data.table
    annotate_splits <- function(x, file) {
      ud_model <- udpipe_load_model(file)
      x <- as.data.table(udpipe_annotate(ud_model, 
                                         x = x$feedback,
                                         doc_id = x$id))
      return(x)
    }
    
    
    # load parallel library future.apply
    library(future.apply)
    
    # Define cores to be used
    ncores <- 3L
    plan(multiprocess, workers = ncores)
    
    # split comments based on available cores
    corpus_splitted <- split(comments, seq(1, nrow(comments), by = 100))
    
    annotation <- future_lapply(corpus_splitted, annotate_splits, file = ud_model$file_model)
    annotation <- rbindlist(annotation)