The textcnt function in R's tau package has a split argument and it's default value is split = "[[:space:][:punct:][:digit:]]+" ç this argumet uses the apostrophe ' to split into words too and I don't want that, how can I modify the argument so it doesn't use the apostrophe to split words?
this code:
`library(tau) text<-"I don't want the function to use the ' to split"
textcnt(text, split = "[[:space:][:punct:][:digit:]]+",method="string",n=1L)`
produces this output:
don function i split t the to use want
1 1 1 1 1 2 2 1 1
instead of having don 1 and t 1, i would like to keep don't as 1 word
I have tried to use str_replace_all from stringr to remove the punctuation beforehand and then omit the punct part of the argument in textcnt but then it doesn't use all kind of symbols such as & > or " to split, I have tried to modify the split argument but then it doesn't split the sentence at all or it keeps the symbols
Thank you
With PCRE-based functions you need to use
split = "(?:(?!')[[:space:][:punct:][:digit:]])+|'\\B|\\B'"
Here,
(?:
- start of a container non-capturing group:(?!')
- fail the match if the next char is a '
char[[:space:][:punct:][:digit:]]
- matches whitespace, punctuation or digit char)+
- match one or more times (consecutively)'\B
- a '
char that is followed with either end of string or a non-word char|
- or\B'
- a '
that is preceded with either start of string or a non-word char.With stringr
functions, you can use
split = "[[:space:][:punct:][:digit:]--[']]+|'\\B|\\B'"
Here, [[:space:][:punct:][:digit:]--[']]
matches all characters matched by [[:space:][:punct:][:digit:]]
except the '
chars.
stringr
ICU regex flavor supports character class subtraction using this notation.