I am doing text mining with R and I get an "issue" I would like to solve...
In order to find the reports in corpus that contain the most a given word or expression, I use kwic
function from quanteda
package like this :
result <- kwic (corp2,c(phrase("trous oblongs")))
where corp2
is a corpus. trous oblongs
is in french and it is a plural. When I do this however, I will only get the reports containing the expression at the plural. I would also like to take into account the occurences of the singular form trou oblong
(and vice versa if I initially put in the code trou oblong
, get the plural also).
I know that udpipe
package, thanks to its udpipe_annotate
function :https://www.rdocumentation.org/packages/udpipe/versions/0.3/topics/udpipe_annotate, is able to extract the lemma of the words in the text.
So I would like to know if udpipe
has a function that could manage to find all the occurences of the words having the same lemma in a corpus, or if it possible to do that with kwic
.
Thanks in advance
Quanteda has tokens_wordstem()
which uses SnoballC's stemmer:
toks <- tokens(corp2)
toks_stem <- tokens_wordstem(toks, "french")
kwic(toks_stem, phrase("trous oblong"))
Alternatively, you can also use * wildcard to search for stems:
toks <- tokens(corp2)
kwic(toks, phrase("trou* oblong*"))