I'm running a topic model project, and using quanteda
and quanteda.textstats
to look at bigrams before I make the actual model. When I try tokens_compound
, I get the error:
Error in `[.data.frame`(col, col$z > 1) : undefined columns selected
.
However, I know that the column z
exists, that there are values in z > 1, and I'm using someone else's code that I know definitively works. Below is the code I used. I'm using the janeausten
package for reproducibility ease; I get the same error with that dataset as well.
library(janeaustenr)
library(tidyverse)
library(quanteda)
library(quanteda.textstats)
data(sensesensibility)
sensesensibility <- as.data.frame(sensesensibility)
##add id column to the data frame
sensesensibility <- sensesensibility %>%
mutate(id = row_number()) %>%
slice(13:30) ##first few lines aren't full lines of text
corpus <- corpus(sensesensibility,
docid_field = "id",
text_field = "sensesensibility")
tokens <- corpus %>%
tokens(remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE,
remove_separators = TRUE) %>%
tokens_tolower()
##see if it worked (it did)
tokens
col <- textstat_collocations(tokens, size = 2)
tokens.col <- tokens_compound(tokens, pattern = col[col$z > 1], concatenator = "_")
col
is a two-dimensional object our overclassing of it prefers the two-dimensional indexing.
So this works:
tokens.col <- tokens_compound(tokens,
pattern = col[col$z > 1, ],
concatenator = "_")
But since you're using tidyverse, why not do it with filter()
? (I've removed the unnecessary concatenator
argument since "_"
is the default.)
tokens2.col <- tokens_compound(tokens, filter(col, z > 1))
all.equal(tokens.col, tokens2.col)
## [1] TRUE