I want to preprocess my text data using the {quanteda}
package in R. To do so, I am creating a corpus, which is then tokenized and preprocessed (e.g. lowercase, remove punctuation, etc.).
Ideally, I would then want to restore the initial sentence structure of the corpus, whilst keeping the document variables, because I am following a string-of-words approach in the analysis.
# Create an example corpus.
my_corpus <- corpus(c("This is a sentence. \n\nThis is another sentence.",
"This is the first sentence of the second document.",
"This is yet another ... ••• *** sentence."))
# Set docvars.
docvars(my_corpus) <- data.frame(doc_id = 1:3, author = c("A", "B", "C"))
# Three documents and four sentences.
ndoc(my_corpus)
nsentence(my_corpus)
# Tokenize and preprocess.
my_tokens <- my_corpus %>%
tokens(remove_punct = T) %>%
tokens_tolower()
my_tokens
# Docvars are still present.
docvars(my_tokens)
I could then simply do the following to restore the sentence structure. However, in the process of doing so, I would lose my docvars:
# Back-transform to sentences.
my_corpus.clean <- vapply(my_tokens, paste, collapse = " ", character(1)) %>% corpus()
# Docvars are lost.
docvars(my_corpus.clean)
The preprocessing worked and so did restoring the sentence structure, but I no longer have my docvars. I could then add them back to the new corpus object (docvars(...) <- ...
), but am afraid that the docvars values will no longer correspond to the right documents.
Is there a way to transform the tokens object back to a sentence-based object that avoids losing the docvars?
Try this at the end:
# back-transform to sentences.
my_corpus.clean <- vapply(my_tokens, paste, collapse = " ", character(1)) |>
corpus(docvars = docvars(my_tokens))
# docvars are present
my_corpus.clean
#> Corpus consisting of 3 documents and 2 docvars.
#> text1 :
#> "this is a sentence this is another sentence"
#>
#> text2 :
#> "this is the first sentence of the second document"
#>
#> text3 :
#> "this is yet another sentence"
docvars(my_corpus.clean)
#> doc_id author
#> 1 1 A
#> 2 2 B
#> 3 3 C