rubyregexruby-1.9.3

Confusion with Atomic Grouping - how it differs from the Grouping in regular expression of Ruby?


I have gone through the docs for Atomic Grouping and rubyinfo and some questions came into my mind:

  1. Why the name "Atomic grouping"? What "atomicity" does it have that general grouping doesn't?
  2. How does atomic grouping differ to general grouping?
  3. Why are atomic groups called non-capturing groups?

I tried the below code to understand but had confusion about the output and how differently they work on the same string as well?

irb(main):001:0> /a(?>bc|b)c/ =~ "abbcdabcc"
=> 5
irb(main):004:0> $~
=> #<MatchData "abcc">
irb(main):005:0> /a(bc|b)c/ =~ "abcdabcc"
=> 0
irb(main):006:0> $~
=> #<MatchData "abc" 1:"b">

Solution

  • A () has some properties (include those such as (?!pattern), (?=pattern), etc. and the plain (pattern)), but the common property between all of them is grouping, which makes the arbitrary pattern a single unit (unit is my own terminology), which is useful in repetition.

    The normal capturing (pattern) has the property of capturing and group. Capturing means that the text matches the pattern inside will be captured so that you can use it with back-reference, in matching or replacement. The non-capturing group (?:pattern) doesn't have the capturing property, so it will save a bit of space and speed up a bit compared to (pattern) since it doesn't store the start and end index of the string matching the pattern inside.

    Atomic grouping (?>pattern) also has the non-capturing property, so the position of the text matched inside will not be captured.

    Atomic grouping adds property of atomic compared to capturing or non-capturing group. Atomic here means: at the current position, find the first sequence (first is defined by how the engine matches according to the pattern given) that matches the pattern inside atomic grouping and hold on to it (so backtracking is disallowed).

    A group without atomicity will allow backtracking - it will still find the first sequence, then if the matching ahead fails, it will backtrack and find the next sequence, until a match for the entire regex expression is found or all possibilities are exhausted.

    Example

    Input string: bbabbbabbbbc
    Pattern: /(?>.*)c/

    The first match by .* is bbabbbabbbbc due to the greedy quantifier *. It will hold on to this match, disallowing c from matching. The matcher will retry at the next position to the end of the string, and the same thing happens. So nothing matches the regex at all.


    Input string: bbabbbabbbbc
    Pattern: /((?>.*)|b*)[ac]/, for testing /(((?>.*))|(b*))[ac]/

    There are 3 matches to this regex, which are bba, bbba, bbbbc. If you use the 2nd regex, which is the same but with capturing groups added for debugging purpose, you can see that all the matches are result of matching b* inside.

    You can see the backtracking behavior here.