rquanteda

Using quanteda to tokenize large datasets and limited RAM


I have a dataset consisting of approximately 2.5 million rows of text, and I'm encountering memory issues when trying to tokenize the entire dataset at once using quanteda. My initial approach was to divide the dataset into smaller subsets for tokenization and then combine the results into a list of lists. However, I'm facing difficulties in achieving the desired outcome. When using purrr::flatten, I end up with a serialized list of integers corresponding to a vector of types, rather than obtaining the actual tokens.

I would greatly appreciate any suggestions or ideas on how to solve this problem. Here's the code I've implemented so far:

# Tokenization function
tokenize_subset <- function(subset_corpus) {
  tokens(
    subset_corpus,
    remove_numbers = TRUE,
    remove_punct = TRUE,
    remove_symbols = TRUE,
    remove_url = TRUE,
    remove_separators = TRUE,
    split_hyphens = TRUE
  ) %>%
    tokens_split(separator = "[[:digit:]]", valuetype = "regex") %>%
    tokens_tolower()
}

# Apply tokenization function to each group of "ind"
token_list <- lapply(unique(docvars(key_corpus, "ind")), function(i) {
  subset_corpus <- corpus_subset(key_corpus, subset = ind == i)
  tokenize_subset(subset_corpus)
})

token_list <- purrr::flatten(token_list)

Any suggestions on how to modify the code or alternative approaches would be highly appreciated. Thank you!


Solution

  • Hard to know how to work around this without your dataset or knowing more about the length of the 2.5 million documents, or your system limits (RAM).

    But you could try this: splitting the input file into subsets (say, 500k each) and then loading each as a corpus, tokenising it, and saving the tokens object to disk. Clear the memory, then do the next slice. In the end, clear the memory, and use c() to combine the tokens into a single tokens object.

    Alternatively, if you can load the entire tokens object into memory, try setting

    quanteda_options(tokens_block_size = 2000)
    

    or a lower number, since this effectively batches the documents and internally recompiles the integer table that tokens uses. The default is 100000 but you might avoid hitting memory limits by using a lower number.