rregexpcrecharacter-replacement

Can quantifiers be used in regex replacement in R?


My objective would be replacing a string by a symbol repeated as many characters as have the string, in a way as one can replace letters to capital letters with \\U\\1, if my pattern was "...(*)..." my replacement for what is captured by (*) would be something like x\\q1 or {\\q1}x so I would get so many x as characters captured by *.

Is this possible?

I am thinking mainly in sub,gsub but you can answer with other libraris like stringi,stringr, etc. You can use perl = TRUE or perl = FALSE and any other options with convenience.

I assume the answer can be negative, since seems to be quite limited options (?gsub):

a replacement for matched pattern in sub and gsub. Coerced to character if possible. For fixed = FALSE this can include backreferences "\1" to "\9" to parenthesized subexpressions of pattern. For perl = TRUE only, it can also contain "\U" or "\L" to convert the rest of the replacement to upper or lower case and "\E" to end case conversion. If a character vector of length 2 or more is supplied, the first element is used with a warning. If NA, all elements in the result corresponding to matches will be set to NA. 

Main quantifiers are (?base::regex):

?

    The preceding item is optional and will be matched at most once.
*

    The preceding item will be matched zero or more times.
+

    The preceding item will be matched one or more times.
{n}

    The preceding item is matched exactly n times.
{n,}

    The preceding item is matched n or more times.
{n,m}

    The preceding item is matched at least n times, but not more than m times.

Ok, but it seems to be an option (which is not in PCRE, not sure if in PERL or where...) (*) which captures the number of characters the star quantifier is able to match (I found it at https://www.rexegg.com/regex-quantifier-capture.html) so then it could be used \q1 (same reference) to refer to the first captured quantifier (and \q2, etc.). I also read that (*) is equivalent to {0,} but I'm not sure if this is really the fact for what I'm interested in.

EDIT UPDATE:

Since asked by commenters I update my question with an specific example provide by this interesting question. I modify a bit the example. Let's say we have a <- "I hate extra spaces elephant" so we are interested in keeping the a unique space between words, the 5 first characters of each word (till here as the original question) but then a dot for each other character (not sure if this is what is expected in the original question but doesn't matter) so the resulting string would be "I hate extra space. eleph..." (one . for the last s in spaces and 3 dots for the 3 letters ant in the end of elephant). So I started by keeping the 5 first characters with

gsub("(?<!\\S)(\\S{5})\\S*", "\\1", a, perl = TRUE)
[1] "I hate extra space eleph"

How should I replace the exact number of characters in \\S* by dots or any other symbol?


Solution

  • Quantifiers cannot be used in the replacement pattern, nor the information how many chars they match.

    What you need is a \G base PCRE pattern to find consecutive matches after a specific place in the string:

    a <- "I hate extra spaces elephant"
    gsub("(?:\\G(?!^)|(?<!\\S)\\S{5})\\K\\S", ".", a, perl = TRUE)
    

    See the R demo and the regex demo.

    Details