rtext-miningquantedaudpipe

Find words in a corpus based on lemma


I am doing text mining with R and I get an "issue" I would like to solve... In order to find the reports in corpus that contain the most a given word or expression, I use kwicfunction from quantedapackage like this :

result <- kwic (corp2,c(phrase("trous oblongs")))

where corp2is a corpus. trous oblongsis in french and it is a plural. When I do this however, I will only get the reports containing the expression at the plural. I would also like to take into account the occurences of the singular form trou oblong(and vice versa if I initially put in the code trou oblong, get the plural also).

I know that udpipepackage, thanks to its udpipe_annotate function :https://www.rdocumentation.org/packages/udpipe/versions/0.3/topics/udpipe_annotate, is able to extract the lemma of the words in the text.

So I would like to know if udpipe has a function that could manage to find all the occurences of the words having the same lemma in a corpus, or if it possible to do that with kwic.

Thanks in advance


Solution

  • Quanteda has tokens_wordstem() which uses SnoballC's stemmer:

    toks <- tokens(corp2)
    toks_stem <- tokens_wordstem(toks, "french")
    kwic(toks_stem, phrase("trous oblong"))
    

    Alternatively, you can also use * wildcard to search for stems:

    toks <- tokens(corp2)
    kwic(toks, phrase("trou* oblong*"))