I have a corpus with .txt
documents. From these .txt
documents, I do not need all sentences, but I only want to keep certain sentences that contain specific key words. From there on, I will perform similarity measures etc.
So, here is an example. From the data_corpus_inaugural data set of the quanteda package, I only want to keep the sentences in my corpus that contain the words "future" and/or "children".
I load my packages and create the corpus:
library(quanteda)
library(stringr)
## corpus with data_corpus_inaugural of the quanteda package
corpus <- corpus(data_corpus_inaugural)
summary(corpus)
Then I want to keep only those sentences that contain my key words
## keep only those sentences of a document that contain words future or/and
children
First, let's see which documents contain these key words
## extract all matches of future or children
str_extract_all(corpus, pattern = "future|children")
So far, I only found out how to exclude the sentences that contain my key words, which is the opposite of what I want to do.
## excluded sentences that contains future or children or both (?)
corpustrim <- corpus_trimsentences(corpus, exclude_pattern =
"future|children")
summary(corpustrim)
The above command excludes sentences containing my key words. My idea here with the corpus_trimsentences function is to exclude all sentences BUT those containing "future" and/or "children".
I tried with regular expression. However, I did not manage to do it. It does not return what I want.
I looked into the corpus_reshape
and corpus_subset
functions of the quanteda package but I can't figure out how to use them for my purpose.
You are correct that it's corpus_reshape()
and corpus_subset()
that you want here. Here's how to use them.
First, reshape the corpus to sentences.
library("quanteda")
data_corpus_inauguralsents <-
corpus_reshape(data_corpus_inaugural, to = "sentences")
data_corpus_inauguralsents
The use stringr to create a logical (Boolean) that indicates the presence or absence of the pattern, equal in length to the new sentence corpus.
containstarget <-
stringr::str_detect(texts(data_corpus_inauguralsents), "future|children")
summary(containstarget)
## Mode FALSE TRUE
## logical 4879 137
Then use corpus_subset()
to keep only those with the pattern:
data_corpus_inauguralsentssub <-
corpus_subset(data_corpus_inauguralsents, containstarget)
tail(texts(data_corpus_inauguralsentssub), 2)
## 2017-Trump.30
## "But for too many of our citizens, a different reality exists: mothers and children trapped in poverty in our inner cities; rusted-out factories scattered like tombstones across the landscape of our nation; an education system, flush with cash, but which leaves our young and beautiful students deprived of all knowledge; and the crime and the gangs and the drugs that have stolen too many lives and robbed our country of so much unrealized potential."
## 2017-Trump.41
## "And now we are looking only to the future."
Finally, if you want to put these selected sentences back into their original document containers, but without the sentences that did not contain the target words, then reshape again:
# reshape back to documents that contain only sentences with the target terms
corpus_reshape(data_corpus_inauguralsentssub, to = "documents")
## Corpus consisting of 49 documents and 3 docvars.