I have a Corpus
(tm
package), containing a collection of 1.300 different text documents [Content: documents: 1.300].
My goal is now to search the frequency of a specific wordlist in each of those documents. E.g. if my wordlist contains the words "january, february, march,...."
. I want to analyze how often the documents refer to these words.
Example:
Text 1: I like going on holiday in january and not in february.
Text 2: I went on a holiday in march.
Text 3: I like going on vacation.
The result should look like this:
Text 1: 2
Text 2: 1
Text 3: 0
I tried using the following codes:
library(quanteda)
toks <- tokens(x)
toks <- tokens_wordstem(toks)
dtm <- dfm(toks)
dict1 <- dictionary(list(c("january", "february", "march")))
dict_dtm2 <- dfm_lookup(dtm, dict1, nomatch="_unmatched")
tail(dict_dtm2)
This code was proposed in a different chat, however it does not work on mine and an error, saying it is only applicaple on text or corpus elements occurs.
How can I search for my wordlist using my existing Corpus
in tm
package in R?
To make your Quanteda code work, you first have to convert your tm VCorpus object x
+ fix few other minor issues:
dictionary()
expects a named listlibrary(tm)
library(quanteda)
## prepare reprex, create tm VCorpus:
docs <- c("I like going on holiday in january and not in february.",
"I went on a holiday in march.",
"I like going on vacation.")
x <- VCorpus(VectorSource(docs))
class(x)
#> [1] "VCorpus" "Corpus"
### tm VCorpus object to Quanteda corpus:
x <- corpus(x)
class(x)
#> [1] "corpus" "character"
### continue with tokenization and stemmming
toks <- tokens(x)
toks <- tokens_wordstem(toks)
dtm <- dfm(toks)
# dictionary() takes a named list, i.e. list(months = c(..))
# and "january", "february" are stemmed to "januari", "februari"
dict1 <- dictionary(list(months = c("januar*", "februar*", "march")))
dict_dtm2 <- dfm_lookup(dtm, dict1, nomatch="_unmatched")
dict_dtm2
#> Document-feature matrix of: 3 documents, 2 features (16.67% sparse) and 7 docvars.
#> features
#> docs months _unmatched
#> text1 2 10
#> text2 1 7
#> text3 0 6
Created on 2023-09-02 with reprex v2.0.2