rquanteda

Getting error "undefined columns selected" when using Quanteda.textstats


I'm running a topic model project, and using quanteda and quanteda.textstats to look at bigrams before I make the actual model. When I try tokens_compound, I get the error:

Error in `[.data.frame`(col, col$z > 1) : undefined columns selected.

However, I know that the column z exists, that there are values in z > 1, and I'm using someone else's code that I know definitively works. Below is the code I used. I'm using the janeausten package for reproducibility ease; I get the same error with that dataset as well.

library(janeaustenr)
library(tidyverse)
library(quanteda)
library(quanteda.textstats)

data(sensesensibility)

sensesensibility <- as.data.frame(sensesensibility)

##add id column to the data frame
sensesensibility <- sensesensibility %>% 
  mutate(id = row_number()) %>% 
  slice(13:30) ##first few lines aren't full lines of text

corpus <- corpus(sensesensibility,
                     docid_field = "id",
                     text_field = "sensesensibility")



tokens <- corpus %>%
  tokens(remove_punct = TRUE, 
         remove_numbers = TRUE, 
         remove_symbols = TRUE, 
         remove_separators = TRUE) %>%
  tokens_tolower()

##see if it worked (it did)
tokens

col <- textstat_collocations(tokens, size = 2)

tokens.col <- tokens_compound(tokens, pattern = col[col$z > 1], concatenator = "_")


Solution

  • col is a two-dimensional object our overclassing of it prefers the two-dimensional indexing.

    So this works:

    tokens.col <- tokens_compound(tokens, 
                                  pattern = col[col$z > 1, ],
                                  concatenator = "_")
    

    But since you're using tidyverse, why not do it with filter()? (I've removed the unnecessary concatenator argument since "_" is the default.)

    tokens2.col <- tokens_compound(tokens, filter(col, z > 1))
    
    all.equal(tokens.col, tokens2.col)
    ## [1] TRUE