I would like to preserve two letter acronyms in my unigram frequency table that are separated by periods such as "t.v." and "u.s.". When I build my unigram frequency table with quanteda, the teminating period is getting truncated. Here is a small test corpus to illustrate. I have removed periods as sentence separators:
SOS This is the u.s. where our politics is crazy EOS
SOS In the US we watch a lot of t.v. aka TV EOS
SOS TV is an important part of life in the US EOS
SOS folks outside the u.s. probably don't watch so much t.v. EOS
SOS living in other countries is probably not any less crazy EOS
SOS i enjoy my sanity when it comes to visit EOS
which I load into R as character vector:
acro.test <- c("SOS This is the u.s. where our politics is crazy EOS", "SOS In the US we watch a lot of t.v. aka TV EOS", "SOS TV is an important part of life in the US EOS", "SOS folks outside the u.s. probably don't watch so much t.v. EOS", "SOS living in other countries is probably not any less crazy EOS", "SOS i enjoy my sanity when it comes to visit EOS")
Here is the code I use to build my unigram frequency table:
library(quanteda)
dat.dfm <- dfm(acro.test, ngrams=1, verbose=TRUE, concatenator=" ", toLower=FALSE, removeNumbers=TRUE, removePunct=FALSE, stopwords=FALSE)
dat.mat <- as.data.frame(as.matrix(docfreq(dat.dfm)))
ng.sorted <- sort(rowSums(dat.mat), decreasing=TRUE)
freqTable <- data.frame(ngram=names(ng.sorted), frequency = ng.sorted)
row.names(freqTable) <- NULL
freqTable
This produces the following:
ngram frequency
1 SOS 6
2 EOS 6
3 the 4
4 is 3
5 . 3
6 u.s 2
7 crazy 2
8 US 2
9 watch 2
10 of 2
11 t.v 2
12 TV 2
13 in 2
14 probably 2
15 This 1
16 where 1
17 our 1
18 politics 1
19 In 1
20 we 1
21 a 1
22 lot 1
23 aka 1
etc...
I would like to keep the terminal periods on t.v. and u.s. as well as eliminate the entry in the table for . with a frequency of 3.
I also don't understand why the period (.) would have a count of 3 in this table while counting the u.s and t.v unigrams correctly (2 each).
The reason for this behaviour is that quanteda's default word tokeniser uses the ICU-based definition for word boundaries (from the stringi package). u.s.
appears as the word u.s.
followed by a period .
token. This is great if your name is will.i.am but maybe not so great for your purposes. But you can easily switch to the white-space tokeniser, using the argument what = "fasterword"
passed to tokens()
, an option available in dfm()
through the ...
part of the function call.
tokens(acro.test, what = "fasterword")[[1]]
## [1] "SOS" "This" "is" "the" "u.s." "where" "our" "politics" "is" "crazy" "EOS"
You can see that here, u.s.
is preserved. In response to your last question, the terminal .
had a document frequency of 3 because it appeared in three documents as a separate token, which is the default word tokeniser behaviour when remove_punct = FALSE
.
To pass this through to dfm()
and then construct your data.frame of the document frequency of the words, the following code works (I've tidied it up a bit for efficiency). Note the comment about the difference between document and term frequency - I've noted that some users are a bit confused about docfreq()
.
# I removed the options that were the same as the default
# note also that stopwords = TRUE is not a valid argument - see remove parameter
dat.dfm <- dfm(acro.test, tolower = FALSE, remove_punct = FALSE, what = "fasterword")
# sort in descending document frequency
dat.dfm <- dat.dfm[, names(sort(docfreq(dat.dfm), decreasing = TRUE))]
# Note: this would sort the dfm in descending total term frequency
# not the same as docfreq
# dat.dfm <- sort(dat.dfm)
# this creates the data.frame in one more efficient step
freqTable <- data.frame(ngram = featnames(dat.dfm), frequency = docfreq(dat.dfm),
row.names = NULL, stringsAsFactors = FALSE)
head(freqTable, 10)
## ngram frequency
## 1 SOS 6
## 2 EOS 6
## 3 the 4
## 4 is 3
## 5 u.s. 2
## 6 crazy 2
## 7 US 2
## 8 watch 2
## 9 of 2
## 10 t.v. 2
In my view the named vector produced by docfreq()
on the dfm is a more efficient method for storing the results than your data.frame approach, but you may wish to add other variables.