rnlptmcorpusqdap

How to filter out all short strings (2 and lower chars) in a corpus?


Given a simple string:

t <- "hello world ww ff a wr gj dkjffdkn kuku"

VCorpus(VectorSource(t))

I want to filter out all the 2 and lower length substrings. How can I do this using qdap or tm packages? I know I can use regex for this but is there a function that does it?


Solution

  • With the package qdapRegex, you can do:

    rm_nchar_words(t, "1,2")
    
    [1] "hello world dkjffdkn kuku"