I'm working with quanteda's SOTU corpus and need to subset it to look at President Bush's and Carter's speeches.
I've been learning how to preprocess the corpus when in dfm format, but I'm not certain how to fix the error. This is the code that I have right now.
library("quanteda")
library("dplyr")
library("sotu")
textplot_scale1d(wf_sotu)
sotu_meta %>%
filter(!duplicated(president, fromLast = TRUE)) %>% tail()
sotu <- sotu_meta %>%
bind_cols(text = sotu_text) %>%
mutate(docnames = paste(president, year, sep = ": "))
sotu
sotu_dfm <- sotu %>%
corpus(
docid_field = "docnames",
text_field = "text"
) %>%
dfm_select(pattern = dict,
valuetype = "regex")%>%
dfm_remove(stopwords())
I get the following error message:
Error in corpus.character(x[[text_index]], docvars = docvars, docnames = docname, : docnames must be unique
There are eight duplicates in the "docnames" column.
library(dplyr)
count(sotu, docnames) %>%
filter(n>1)
docnames n
1 Dwight D. Eisenhower: 1956 2
2 Franklin D. Roosevelt: 1945 2
3 George Washington: 1790 2
4 Jimmy Carter: 1978 2
5 Jimmy Carter: 1979 2
6 Jimmy Carter: 1980 2
7 Richard M. Nixon: 1972 2
8 Richard M. Nixon: 1974 2
If you omit those, the code runs error-free.
sotu %>%
filter(n()==1, .by=docnames) %>%
corpus(docid_field = "docnames", text_field = "text")
Corpus consisting of 224 documents and 6 docvars.
George Washington: 1791 :
" Fellow-Citizens of the Senate and House of Representatives..."
George Washington: 1792 :
"Fellow-Citizens of the Senate and House of Representatives: ..."
George Washington: 1793 :
" Fellow-Citizens of the Senate and House of Representatives..."
George Washington: 1794 :
" Fellow-Citizens of the Senate and House of Representatives..."
George Washington: 1795 :
" Fellow-Citizens of the Senate and House of Representatives:..."
George Washington: 1796 :
" Fellow-Citizens of the Senate and House of Representatives..."
[ reached max_ndoc ... 218 more documents ]
You could also relax the uniqueness criteria.
sotu %>%
corpus(docid_field = "docnames", text_field = "text",
unique_docnames = FALSE)