pythonpython-3.xregexpython-regex

Using set operators with python regex module


I'm having trouble getting set operators to work in the regex module (regex 2013-11-29) in python-3.x. For example, to match ASCII characters minus punctuation I have tried:

import regex as rx

data = '(foo)'
for m in rx.finditer(r'[\p{ASCII}--\p{P}]+',data):
    print(m.group(0))     # expect 'foo', getting '(foo)'

The documentation gives this example:

[\p{N}--[0-9]] # Set containing all numbers except '0' .. '9'

Am I missing something here?


Solution

  • It sounds like you need to explicitly opt into Version 1 behavior so that the -- is interpreted as a set operator and not as characters to include in the class.

    From the module web page:

    Version 1 behaviour (new behaviour, different from the current re module):

    Indicated by the VERSION1 or V1 flag, or (?V1) in the pattern.

    • .split will split a string at a zero-width match.

    • Inline flags apply to the end of the group or pattern, and they can be turned off.

    • Nested sets and set operations are supported.

    • Case-insensitive matches in Unicode use full case-folding by default.

    • If no version is specified, the regex module will default to regex.DEFAULT_VERSION. In the short term this will be VERSION0, but in the longer term it will be VERSION1.