javaregexregex-lookarounds

Capture stream of digits which is not followed by certain digits


I wanted to capture a stream of digits which are not followed by certain digits. For example

input = abcdef lookbehind 123456..... asjdnasdh lookbehind 789432

I want to capture 789432 and not 123 using negative lookahead only.

I tried (?<=lookbehind )([\d])+(?!456) but it captures 123456 and 789432.

Using (?<=lookbehind )([\d])+?(?!456) captures only 1 and 7.

Grouping is not an option for me as my use case doesn't allow me to do it.

Is there any way I can capture 789432 and not 123 using pure regex? An explanation for the answer is appreciated.


Solution

  • You may use a possessive quantifier with a negative lookbehind

    (?<=lookbehind )\d++(?<!456)
                      ^^ ^^^^^^ 
    

    See this regex demo.

    A synonymous pattern with an atomic group:

    (?<=lookbehind )(?>\d+)(?<!456)
    

    Details

    Why lookbehind and why not lookahead

    The negative lookbehind (?<!...) makes sure that a certain pattern does not match immediately to the left of the current location. A negative lookahead (?!...) fails the match if its pattern matches immediately to the right of the current location. "Fail" here means that the regex engine abandons the current way of matching a string, and if there are quantified patterns before the lookbehind/lookahead the engine might backtrack into those patterns to try and match a string differently. Note that here, a possessive quantifier makes it impossible for the engine to perform the lookbehind check for 456 multiple times, it is only executed once all the digits are grabbed with \d++.

    You (?<=lookbehind )([\d])+(?!456) regex matches 123456 because the \d+ matches these digits in a greedy way (all at once) and (?!456) checks for 456 after them, and since there are no 456 there, the match is returned. The (?<=lookbehind )([\d])+?(?!456) matches only one digit because \d+? matches in a lazy way, 1 digit is matched and then the loolahead check is performed. Since there is no 456 after 1, 1 is returned.

    why ++ possessive quantifier

    It does not allow a regex engine to retry matching a string differently if there are quantified patterns before. So, (?<=lookbehind )\d+(?<!456) matches 12345 in 123456 as there is no 456 before 6.