rnlpspacytidymodels

R Tidymodels textrecipes - tokenizing with spacyR - how to remove punctuations from produced list of tokens


I would like to tokenize my text by using the step_tokenize with the spacyR engine before proceeding to lemmatisation using step_lemma. Following that, i would like to remove for example punctuations from the list of tokens.

When using the default tokenizers::tokenize_words you can pass this option through a list of options in step_tokenize().

However, my understanding is that step_tokenize uses spacy_parse on the backend which does not provide such an option.

Is there a way to remove for e.g. punctuations or numeric tokens from the tokens produced after lemmatisation using step_lemma()?

A reprex:

library(tidyverse)
library(tidymodels)
library(textrecipes)
library(spacyr)

text = "It was a day, Tuesday. It wasn't Thursday!"

df <- tibble(text)

spacyr::spacy_initialize(entity = FALSE)

lexicon_features_tokenized_lemmatised <-
  recipe(~ text, data = df%>%head(1)) %>%
  step_tokenize(text, engine = "spacyr") %>%
  step_lemma(text) %>%
  prep() %>%
  bake(new_data = NULL) 

lexicon_features_tokenized_lemmatised %>% pull(text) %>%textrecipes:::get_tokens()

Output: "it", "be", "a", "day", ",", "Tuesday", ".", "it", "be", "not", "Thursday", "!"

Desired output (Removal of "!", "," and "."): "it", "be", "a", "day", "Tuesday", "it", "be", "not", "Thursday"


Solution

  • You want to use the step_pos_filter() to filter the output of spacy by POS.

    It is a little annoying, because you have to specify the types to keep. Full list of tags found here https://github.com/explosion/spaCy/blob/master/spacy/glossary.py

    library(tidyverse)
    library(tidymodels)
    library(textrecipes)
    library(spacyr)
    
    text = "It was a day, Tuesday. It wasn't Thursday!"
    
    df <- tibble(text)
    
    spacyr::spacy_initialize(entity = FALSE)
    
    pos <- c("ADJ", "ADP", "ADV", "AUX", "CONJ", "CCONJ", "DET", "INTJ", "NOUN", 
             "NUM", "PART", "PRON", "PROPN", "SCONJ", "SYM", "VERB", "X", "EOL", 
             "SPACE")
    
    lexicon_features_tokenized_lemmatised <-
      recipe(~ text, data = df %>% head(1)) %>%
      step_tokenize(text, engine = "spacyr") %>%
      step_pos_filter(text, keep_tags = pos) %>%
      step_lemma(text) %>%
      prep() %>%
      bake(new_data = NULL) 
    
    lexicon_features_tokenized_lemmatised %>% 
      pull(text) %>%
      textrecipes:::get_tokens()
    #> [[1]]
    #> [1] "it"       "be"       "a"        "day"      "Tuesday"  "it"       "be"      
    #> [8] "not"      "Thursday"