rtexttokenizecorpusquanteda

Detokenize a Quanteda tokens object


I have a quanteda tokens object that I created using the "window" option (see code below). I'm interested in doing this on a series of words in order to inform the creation of a custom dictionary. How can I "de-tokenize" or concatenate or recombine each tokenized "window" text into a string. Each string could either be an item in a list or a row in a data.frame. I just need to be able to read instances of the word/phrase (in this case "future") in its context.

Is there some command or code that would let me "de-tokenize" this?

library(quanteda)
library(dplyr)

# Example data
d <- c("Thank you Mr. Speaker.  Mr. Speaker I’m not sure how,   but to the department of PWTTS, regarding the question I’d asked previously about the  future of our water reservoir.  I wonder if that was looked at since I ask that question to  Ms. Thompson.  Thank you", "Thank you Mr. Speaker.  Now if that doctor would be  located in that community how is the logistics or air travel going to be, moving between  the communities in the future.  Thank you")

# Corpus
c <- corpus(d)

# My tokens object consisting of 3-word window around instances of "future".
ttt <- tokens(c, remove_punct = T, remove_numbers = F) %>%
  tokens_keep( pattern = "future", window = 3) 


Solution

  • For a list output:

    > lapply(ttt, paste, collapse = " ")
    $text1
    [1] "previously about the future of our water"
    
    
    $text2
    [1] "communities in the future Thank you"
    

    Or for a character vector, which could easily become a column element in your data.frame:

    > vapply(ttt, paste, collapse = " ", character(1))
                                         text1                                      text2 
    "previously about the future of our water"      "communities in the future Thank you"