I would like to tokenize my text by using the step_tokenize with the spacyR engine before proceeding to lemmatisation using step_lemma. Following that, i would like to remove for example punctuations from the list of tokens.
When using the default tokenizers::tokenize_words you can pass this option through a list of options in step_tokenize().
However, my understanding is that step_tokenize uses spacy_parse on the backend which does not provide such an option.
Is there a way to remove for e.g. punctuations or numeric tokens from the tokens produced after lemmatisation using step_lemma()?
A reprex:
library(tidyverse)
library(tidymodels)
library(textrecipes)
library(spacyr)
text = "It was a day, Tuesday. It wasn't Thursday!"
df <- tibble(text)
spacyr::spacy_initialize(entity = FALSE)
lexicon_features_tokenized_lemmatised <-
recipe(~ text, data = df%>%head(1)) %>%
step_tokenize(text, engine = "spacyr") %>%
step_lemma(text) %>%
prep() %>%
bake(new_data = NULL)
lexicon_features_tokenized_lemmatised %>% pull(text) %>%textrecipes:::get_tokens()
Output: "it", "be", "a", "day", ",", "Tuesday", ".", "it", "be", "not", "Thursday", "!"
Desired output (Removal of "!", "," and "."): "it", "be", "a", "day", "Tuesday", "it", "be", "not", "Thursday"
You want to use the step_pos_filter()
to filter the output of spacy by POS.
It is a little annoying, because you have to specify the types to keep. Full list of tags found here https://github.com/explosion/spaCy/blob/master/spacy/glossary.py
library(tidyverse)
library(tidymodels)
library(textrecipes)
library(spacyr)
text = "It was a day, Tuesday. It wasn't Thursday!"
df <- tibble(text)
spacyr::spacy_initialize(entity = FALSE)
pos <- c("ADJ", "ADP", "ADV", "AUX", "CONJ", "CCONJ", "DET", "INTJ", "NOUN",
"NUM", "PART", "PRON", "PROPN", "SCONJ", "SYM", "VERB", "X", "EOL",
"SPACE")
lexicon_features_tokenized_lemmatised <-
recipe(~ text, data = df %>% head(1)) %>%
step_tokenize(text, engine = "spacyr") %>%
step_pos_filter(text, keep_tags = pos) %>%
step_lemma(text) %>%
prep() %>%
bake(new_data = NULL)
lexicon_features_tokenized_lemmatised %>%
pull(text) %>%
textrecipes:::get_tokens()
#> [[1]]
#> [1] "it" "be" "a" "day" "Tuesday" "it" "be"
#> [8] "not" "Thursday"