rnlptmcorpus

Searching for specific words in Corpus with R (tm package)


I have a Corpus (tm package), containing a collection of 1.300 different text documents [Content: documents: 1.300].

My goal is now to search the frequency of a specific wordlist in each of those documents. E.g. if my wordlist contains the words "january, february, march,....". I want to analyze how often the documents refer to these words.

Example: 
Text 1: I like going on holiday in january and not in february.
Text 2: I went on a holiday in march.
Text 3: I like going on vacation.

The result should look like this:

Text 1: 2 
Text 2: 1
Text 3: 0

I tried using the following codes:

library(quanteda)
toks <- tokens(x) 
toks <- tokens_wordstem(toks) 

dtm <- dfm(toks)

dict1 <- dictionary(list(c("january", "february", "march")))

dict_dtm2 <- dfm_lookup(dtm, dict1, nomatch="_unmatched")                                 
tail(dict_dtm2)  

This code was proposed in a different chat, however it does not work on mine and an error, saying it is only applicaple on text or corpus elements occurs.

How can I search for my wordlist using my existing Corpus in tm package in R?


Solution

  • To make your Quanteda code work, you first have to convert your tm VCorpus object x + fix few other minor issues:

    library(tm)
    library(quanteda)
    
    ## prepare reprex, create tm VCorpus:
    docs <- c("I like going on holiday in january and not in february.",
              "I went on a holiday in march.",
              "I like going on vacation.")
    x <- VCorpus(VectorSource(docs))
    class(x)
    #> [1] "VCorpus" "Corpus"
    
    ### tm VCorpus object to Quanteda corpus:
    x <- corpus(x)
    class(x)
    #> [1] "corpus"    "character"
    
    ### continue with tokenization and stemmming
    toks <- tokens(x) 
    toks <- tokens_wordstem(toks) 
    dtm <- dfm(toks)
    
    # dictionary() takes a named list, i.e. list(months = c(..))
    # and "january", "february" are stemmed to "januari", "februari"
    dict1 <- dictionary(list(months = c("januar*", "februar*", "march")))
    dict_dtm2 <- dfm_lookup(dtm, dict1, nomatch="_unmatched")                                 
    dict_dtm2
    #> Document-feature matrix of: 3 documents, 2 features (16.67% sparse) and 7 docvars.
    #>        features
    #> docs    months _unmatched
    #>   text1      2         10
    #>   text2      1          7
    #>   text3      0          6
    

    Created on 2023-09-02 with reprex v2.0.2