R Tidymodels textrecipes - tokenizing with spacyR - how to remove punctuations from produced list of tokens

I would like to tokenize my text by using the step_tokenize with the spacyR engine before proceeding to lemmatisation using step_lemma. Following that, i would like to remove for example punctuations from the list of tokens.

When using the default tokenizers::tokenize_words you can pass this option through a list of options in step_tokenize().

However, my understanding is that step_tokenize uses spacy_parse on the backend which does not provide such an option.

Is there a way to remove for e.g. punctuations or numeric tokens from the tokens produced after lemmatisation using step_lemma()?

A reprex:

library(tidyverse)
library(tidymodels)
library(textrecipes)
library(spacyr)

text = "It was a day, Tuesday. It wasn't Thursday!"

df <- tibble(text)

spacyr::spacy_initialize(entity = FALSE)

lexicon_features_tokenized_lemmatised <-
  recipe(~ text, data = df%>%head(1)) %>%
  step_tokenize(text, engine = "spacyr") %>%
  step_lemma(text) %>%
  prep() %>%
  bake(new_data = NULL) 

lexicon_features_tokenized_lemmatised %>% pull(text) %>%textrecipes:::get_tokens()

Output: "it", "be", "a", "day", ",", "Tuesday", ".", "it", "be", "not", "Thursday", "!"

Desired output (Removal of "!", "," and "."): "it", "be", "a", "day", "Tuesday", "it", "be", "not", "Thursday"

Solution

You want to use the step_pos_filter() to filter the output of spacy by POS.

It is a little annoying, because you have to specify the types to keep. Full list of tags found here https://github.com/explosion/spaCy/blob/master/spacy/glossary.py

library(tidyverse)
library(tidymodels)
library(textrecipes)
library(spacyr)

text = "It was a day, Tuesday. It wasn't Thursday!"

df <- tibble(text)

spacyr::spacy_initialize(entity = FALSE)

pos <- c("ADJ", "ADP", "ADV", "AUX", "CONJ", "CCONJ", "DET", "INTJ", "NOUN", 
         "NUM", "PART", "PRON", "PROPN", "SCONJ", "SYM", "VERB", "X", "EOL", 
         "SPACE")

lexicon_features_tokenized_lemmatised <-
  recipe(~ text, data = df %>% head(1)) %>%
  step_tokenize(text, engine = "spacyr") %>%
  step_pos_filter(text, keep_tags = pos) %>%
  step_lemma(text) %>%
  prep() %>%
  bake(new_data = NULL) 

lexicon_features_tokenized_lemmatised %>% 
  pull(text) %>%
  textrecipes:::get_tokens()
#> [[1]]
#> [1] "it"       "be"       "a"        "day"      "Tuesday"  "it"       "be"      
#> [8] "not"      "Thursday"