rword-cloudquanteda

How to subset SOTU dfm to Presidents Bush and Carter in sotu and quanteda to generate a wordcloud chart?


I'm working with quanteda's SOTU corpus and need to subset it to look at President Bush's and Carter's speeches.

I've been learning how to preprocess the corpus when in dfm format, but I'm not certain how to fix the error. This is the code that I have right now.

library("quanteda")
library("dplyr")
library("sotu")

textplot_scale1d(wf_sotu)

sotu_meta %>%
  filter(!duplicated(president, fromLast = TRUE)) %>% tail()

sotu <- sotu_meta %>%
  bind_cols(text = sotu_text) %>%
  mutate(docnames = paste(president, year, sep = ": "))
sotu

sotu_dfm <- sotu %>%
  corpus(
    docid_field = "docnames",
    text_field = "text"
  ) %>%
  dfm_select(pattern = dict, 
        valuetype = "regex")%>%
  dfm_remove(stopwords())

I get the following error message:

Error in corpus.character(x[[text_index]], docvars = docvars, docnames = docname,  : 
  docnames must be unique

Solution

  • There are eight duplicates in the "docnames" column.

    library(dplyr)
    
    count(sotu, docnames) %>%
      filter(n>1)
    
                         docnames n
    1  Dwight D. Eisenhower: 1956 2
    2 Franklin D. Roosevelt: 1945 2
    3     George Washington: 1790 2
    4          Jimmy Carter: 1978 2
    5          Jimmy Carter: 1979 2
    6          Jimmy Carter: 1980 2
    7      Richard M. Nixon: 1972 2
    8      Richard M. Nixon: 1974 2
    

    If you omit those, the code runs error-free.

    sotu %>%
      filter(n()==1, .by=docnames) %>%
      corpus(docid_field = "docnames", text_field = "text")
    

    Corpus consisting of 224 documents and 6 docvars.
    George Washington: 1791 :
    "  Fellow-Citizens of the Senate and House of Representatives..."
    
    George Washington: 1792 :
    "Fellow-Citizens of the Senate and House of Representatives: ..."
    
    George Washington: 1793 :
    "  Fellow-Citizens of the Senate and House of Representatives..."
    
    George Washington: 1794 :
    "  Fellow-Citizens of the Senate and House of Representatives..."
    
    George Washington: 1795 :
    " Fellow-Citizens of the Senate and House of Representatives:..."
    
    George Washington: 1796 :
    "  Fellow-Citizens of the Senate and House of Representatives..."
    
    [ reached max_ndoc ... 218 more documents ]
    

    You could also relax the uniqueness criteria.

    sotu %>% 
      corpus(docid_field = "docnames", text_field = "text",
             unique_docnames = FALSE)
    

    enter image description here