rquanteda

Get around deprecated quanteda texts() function


I am trying to replicate this paper

In the tokens.R script it's cleaning up the corpus with the following command:

texts(corp) <- stri_replace_all_regex(texts(corp), "^[\\p{Lu}\\p{Z}]+(.{0,30}?)(\\(.{0,50}?\\))?(--)", "")

Which yields the following error message:

Error in attributes(.Data) <- c(attributes(.Data), attrib) : 
  'names' attribute [387896] must be the same length as the vector [4]
In addition: Warning message:
'texts.corpus' ist veraltet.
Benutzen Sie stattdessen 'as.character'
Siehe help("Deprecated") 

So I naively apply the 'as.character' function like this:

as.character(corp) <- stri_replace_all_regex(as.character(corp), "^[\\p{Lu}\\p{Z}]+(.{0,30}?)(\\(.{0,50}?\\))?(--)", "")

Which yields the following error

Error in attributes(.Data) <- c(attributes(.Data), attrib) : 
  'names' attribute [387896] must be the same length as the vector [4]

I tried some other things, like only adressing $documents within the corpus or turning the corpus into a vector but none of that really worked.

How can I get around this?

Thank you in advance.


Solution

  • The "corpus" being loaded in the linked .R file tokens.R is using a very old format corpus object (from data/corpus_nytimes_summary.RDS).

    You can convert this into a new format corpus using:

    corp <- corpus(corp)
    

    Then replace the texts using this approach:

    corp[] <- stri_replace_all_regex(corp, "^[\\p{Lu}\\p{Z}]+(.{0,30}?)(\\(.{0,50}?\\))?(--)", "")
    

    The use of corp[] replaces the character part of corp without stripping the additional attributes (metadata and docvars) that make the character object corp a quanteda corpus.