I want to remove accents and punctuation from a dictionary. For example, I want to transform "à l'épreuve" into "a l epreuve". The dictionary is this one: https://www.poltext.org/fr/donnees-et-analyses/lexicoder (.cat). There are explanations for dataframes (Remove accents from a dataframe column in R), but I could not find a way of removing for dictionaries.
My code so far:
dict_lg <- dictionary(file = "frlsd/frlsd.cat", encoding = "UTF-8")
Any suggestion?
This should work:
library(quanteda)
library(stringi)
library(stringr)
dict_lg_ascii <-
dict_lg |>
rapply(f = \(term) term |>
## compose from string utilities as desired
stri_trans_general(id = 'Latin-ASCII') |>
str_replace_all(pattern = '[[:punct:]]', replacement = ' '),
how = 'replace'
)
output:
## > dict_lg_ascii
Dictionary object with 2 primary key entries and 2 nested levels.
- [NEGATIVE]:
- a cornes, a court de personnel , a l etroit, a peine , abais ,
## truncated
from the docs:
Dictionaries can be subsetted using [ and [[, operating the same as the equivalent list operators.
Thus rapply
(recursively applying a function over nested lists) works. In this case, we apply stri_trans_general
as suggested here.